Scraping is a web mining task that is usually done using some kind of script language such as python
. However, since minet
is first and foremost a CLI tool, we need to find another way to declare our scraping intentions.
So, in order to enable efficient scraping, minet
uses a custom declarative DSL (Domain Specific Language), typically written in JSON or YAML and reminiscent of artoo.js scraping functions.
The present document is therefore a full tutorial about the capabilities of this DSL.
Also, if you ever need to read some real examples of scrapers written using the DSL you can check out this folder of the repository.
Finally, this tutorial is fairly long so I encourage you to bail out as soon as you deem you know enough to fit your use-case.
- Applying a minet scraper
- Scraping a single item
- Declaring what to extract
- Declaring multiple fields to extract
- Iterating over multiple nodes
- Subselection
- Item-level subselection
- Nesting data
- Default values
- Filtering
- Keeping only unique items
- Using evaluation when declarative is not enough
- Accessing global context
- Defining local context
- Aliases
- Scraper execution outline
- Various useful examples
Before learning how to write a minet
scraper on your own, here is what a scraper extracting a page's <title>
tag content would look like in YAML format:
---
sel: title
But how would one actually use it?
From python you will need to use the minet.Scraper
class to build your scraper utility like so:
from minet import Scraper
# Creating a scraper from a yaml declaration file:
scraper = Scraper('scraper.yml')
# Creating a scraper directly from a declaration variable:
declaration = {'sel': 'title'}
scraper = Scraper(declaration)
# You can now apply your scraper on arbitrary html thusly:
title = scraper(html)
For more information about minet.Scraper
, be sure to read the relevant documentation here.
To use a scraper from the command line, you can use the scrape
command thusly:
# To scrape html files recorded in a minet fetch report:
minet scrape scraper.yml -i report.csv > scraped.csv
# To scrape html file found by using a glob pattern:
minet scrape scraper.yml 'files/**/*.html' --glob > scraped.csv
If you want to learn about glob patterns, this wikipedia page can teach you how to build them. In our example files/**/*.html
basically means we are searching for any file having the .html
extension in any recursive subfolder found under files/
.
For more information about the minet scrape
command, be sure to read the relevant documentation here.
Let's start easy and let's imagine we only need to scrape a single thing from our HTML page: its title. The title is usually the text contained in the <title>
tag:
<html>
<head>
<title>
My Awesome Web Page
</title>
</head>
<body>
<div id="main" data-message="hello">Hello World!</div>
</body>
</html>
So, here is the simplest scraper ever:
---
sel: title # Here we use CSS selection to target the first <title> tag
If you are a bit shaky about CSS selection and want to refresh your memory or simply learn, I can't recommend this wonderful tutorial enough.
Note that minet
uses the popular BeautifulSoup library to parse and traverse html documents and relies on the soupsieve library for CSS selection.
This said, applying this scraper on the beforementioned HTML will yield the following result:
"My Awesome Web Page"
Two things should be noted here:
- In the absence of any additional information, the scraper will always fallback to extract the node's text if not instructed otherwise.
- Extracted text is always stripped of its leading and trailing whitespace for convenience.
But maybe we want something different than a selected node's text. What if we want its inner html? Or one of its attributes? Then we need to say so using the item
key:
---
sel: body
item: html # Here, we want the node's inner html
This will yield:
"<div id=\"main\" class=\"container\">Hello World!</div>"
Note that the first example we saw is actually a shorthand for:
---
sel: title
item: text
But what if we want to extract a node's attribute? Well it's as simple as:
---
# Note that here I wrap the selector in quotes so that YAML
# does not interpret this as a comment
sel: '#main'
item: data-message
This will naturally yield:
"hello"
So, to recap, here is what item
can be:
text
to extract the selected node's text (this is the same as not declaring anitem
)display_text
: to extract the selected node's text while attempting to render its whitespaces the same way a browser would (for more information check this).html
orinner_html
to extract the selected node's inner htmlouter_html
to extract the selected node's outer html- the name of any of the selected node's attributes
Now we might want to extract several pieces of information from our selected node rather than a single one. To do so, we can use the field
key, instead of item
to structure the output:
---
sel: '#main'
fields: # Here we declare a dictionary of fields to output
content: text # Each key is now its own `item`
message: data-message
This will yield:
{
"content": "Hello World!",
"message": "hello"
}
So far, we now how to select a single node and extract information about it. But oftentimes, we want to iterate over a selection of nodes and extract information about each of them.
Let's consider the following html:
<main>
<article>
<h2>
<a href="http://howtoscrape.com">
How to scrape?
</a>
</h2>
<div>
Posted by <strong>George</strong>
</div>
</article>
<article>
<h2>
<a href="http://howtocrawl.co.uk">
How to crawl?
</a>
</h2>
<div>
Posted by <strong>Mary</strong>
</div>
</article>
</main>
To be able to iterate over the page's articles, we will use the iterator
key to be able to select multiple nodes from which to extract information:
---
iterator: article a # This is still CSS selection
This will yield an array containing one item per selected node:
[
"How to scrape?",
"How to crawl?"
]
You can of course still use the item
and fields
keys to declare finely what to extract:
---
iterator: article a
item: href
will yield:
[
"http://howtoscrape.com",
"http://howtocrawl.co.uk"
]
iterator: article a
fields:
title: text
url: href
will yield:
[
{
"title": "How to scrape?",
"url": "http://howtoscrape.com"
},
{
"title": "How to crawl?",
"url": "http://howtocrawl.co.uk"
}
]
Sometimes, it might be useful to be able to perform subselections. For instance, if we consider the following html:
<ul id="fruits">
<li color="red">Apple</li>
<li color="orange">Orange</li>
</ul>
<ul id="days">
<li>Monday</li>
<li>Saturday</li>
</ul>
and if we want to scrape only the fruits, we could write it thusly:
---
iterator: '#fruits li'
or
sel: '#fruits'
iterator: li
and both would yield the same:
[
"Apple",
"Orange"
]
This basically means than sel
is applied before the iterator
selection and this can be useful if you want to separate concerns.
Now it might also be useful to define subselections at item
or fields
level. But before being able to do so, we need to understand that item
can also be defined as an object rather than a simple string for when things need to be more complex:
---
iterator: '#fruits li'
item:
extract: text # this is the same as `item: text` or no `item`
---
iterator: '#fruits li'
item:
attr: color # this is the same as `item: color` but with less ambiguity
So now we've learn the following sub keys:
extract
can betext
,display_text
,html
,inner_html
andouter_html
and does what you expectattr
returns the designated attribute (this means you can also use it if you have the disfortune of needing to extract an attribute namedtitle
orhtml
, for instance).
This of course also works with fields
:
iterator: '#fruits li'
fields:
name:
extract: text
color:
attr: color
But now that we have a more expressive way to declare what we want to extract, we can also spice things up with item-level subselections.
Remember this html:
<main>
<article>
<h2>
<a href="http://howtoscrape.com">
How to scrape?
</a>
</h2>
<div>
Posted by <strong>George</strong>
</div>
</article>
<article>
<h2>
<a href="http://howtocrawl.co.uk">
How to crawl?
</a>
</h2>
<div>
Posted by <strong>Mary</strong>
</div>
</article>
</main>
What if we want to extract both the article's url and the user which posted it?
We can do so by using sel
in a field declaration:
---
iterator: main > article
fields:
url:
sel: a
attr: href
user:
sel: div > span
And this will yield:
[
{
"url": "http://howtoscrape.com",
"user": "George"
},
{
"url": "http://howtocrawl.co.uk",
"user": "Mary"
}
]
The next thing to understand is that this scraping DSL is completely recursive. Which should let you describe any kind of JSON-like structure to output.
Let's check this html again for instance:
<main>
<ul id="fruits">
<li color="red">Apple</li>
<li color="orange">Orange</li>
</ul>
<ul id="days">
<li>Monday</li>
<li>Saturday</li>
</ul>
</main>
What if we want to return a list of lists of items? Well, say no more:
---
iterator: main > ul
item:
iterator: li
Harnessing recursivity will return a nested result:
[
[
"Apple",
"Orange"
],
[
"Monday",
"Saturday"
]
]
Usually, a scraper will return None
if a selection cannot be found or if the desired attribute does not exists. If so, you might want to return a default value instead. If we consider the following html for instance:
<ul>
<li color="blue">John</li>
<li>Mary</li>
</ul>
You could declare a default value thusly:
---
iterator: li
item:
attr: color
default: black
And this would return:
["blue", "black"]
instead of:
["blue", None]
This default
key can also be thought as a constant value if you ever need one (this could be useful with minet
crawlers for instance, or if you use several scrapers to track from which one a result came from):
iterator: li
fields:
scraper:
default: 'My scraper'
name: text
would return:
[
{"scraper": "My Scraper", "name": "John"},
{"scraper": "My Scraper", "name": "Mary"}
]
When iterating on a selection, it might be useful to filter out some empty or invalid results. If we consider the following html:
<ul>
<li color="blue" age="34">John</li>
<li age="45">Mary</li>
<li color="purple" age="23">Susan</li>
</ul>
We can filter out empty values for the color
attribute using the filter
key in our scraper:
---
iterator: li
item: color
filter: yes # yes is just one of the multiple ways to say `true` in YAML
This should return:
["blue", "purple"]
and not:
["blue", None, "purple"]
The filter
key can also take a field to filter on if fields
is given, like so:
---
iterator: li
fields:
name: text
color: color
filter: color
and this should return:
[
{"name": "John", "color": "blue"},
{"name": "Susan", "color": "purple"}
]
Note that the path given to filter
can be also an array or keys joined by a dot to test nested values. Here is such an example:
---
iterator: li:
fields:
name: text
attributes:
fields:
color: color
age: age
filter: attributes.age
and you will get:
[
{"name": "John", "attributes": {"color": "blue", "age": "34"}},
{"name": "Susan", "attributes": {"color": "purple", "age": "23"}}
]
Finally, if you need more complex filtering, you can still read about filter_eval
here.
When iterating on a selection to return multiple items, you might get the same one multiple times and you might want to avoid that.
If we, for instance, consider the following html:
<ul>
<li color="blue">John</li>
<li color="red">Mary</li>
<li color="blue">Susan</li>
</ul>
You can use the uniq
key to be sure scraping the color
attributes will only yield "blue"
once:
---
iterator: li
item: color
uniq: yes
and you will get:
["blue", "red"]
instead of:
["blue", "red", "blue"]
If you only want to keep a single item based on one of its fields
, you can also give a field to the uniq
key, the same way you would with filter
.
In this case, only the first item having a given value for the chosen field will be kept so that the following scraper:
iterator: li
fields:
name: text
color: color
uniq: color
will return:
[
{"name": "John", "color": "blue"},
{"name": "Mary", "color": "red"}
]
Sometimes, the DSL will not be enough as you may need to perform complex things with your data that the declaration language does not implement.
In this case, someone with more sense than I would tell you that it is time to start scripting in python directly to cover your use case. But if you still want to benefit from minet
integration, craft and ecosystem, you can still give scrapers python code as strings to evaluate.
Warning!: the minet
commands have not been designed as anything else than a tool for personal use. As such, as with anything able to execute arbitrary code, you should never trust potential user inputs as safe if you intend to use them as scrapers. This is clearly a security breach! We might add options to toggle off evaluation capabilities of scrapers in the future, but this does not exist as of today.
Now that this is clear, let's learn how to sprinkle some python code on top of our declarative scrapers!
Let's consider the following html:
<main>
<div id="colors">
<p>Red</p>
<p>Blue</p>
</div>
<div id="animals">
<ul>
<li>Tiger</li>
<li>Dog</li>
</ul>
</div>
</main>
If you want to select the items found under each div
using a condition (yes I know we can get away with some CSS wizardy here, this is a toy example, bear with me please), you will need to evaluate an expression like so:
---
iterator: div
fields:
kind: id
items:
iterator_eval: "element.select('p') if element.get('id') == 'colors' else element.select('li')"
and you will get:
[
{
"kind": "colors",
"items": ["Red", "Blue"]
},
{
"kind": "animals",
"items": ["Tiger", "Dog"]
}
]
In this example, the local variable element
is the current html element being iterated on (a div
, in our case) as represented by BeautifulSoup, a very popular python html representation used by most of scrapers. To get a list of all exposed variables you may use, you can check this list.
Note that evaluated strings can either be a single python expression, or more code expressed on multiple lines that should return
a result at some point. Here is an example of the same logic written on multiple lines in a YAML scraper:
iterator: div
fields:
kind: id
items:
# Mind the "|" char, it means the followed indented strings will preserve line-returns
iterator_eval: |
if element.get('id') == 'color':
return element.select('p')
return element.select('li')
Of course, the syntax of the evaluated strings will be checked beforehand to check if they are valid python code and any minet.Scraper
built using bad code will raise validation errors before you can even attempt to use it:
from minet import Scraper
from minet.scrape.exceptions import InvalidScraperError
definition = {
'iterator': 'div',
'item': {
'eval': 'element.get_text() + '
}
}
try:
scraper = Scraper(definition)
except InvalidScraperError as e:
# This error will be raised because "item.eval" is not valid python code!
# You can check the specific validation error like so:
for error in e.validation_errors:
print(repr(error))
# Note that the command line will report errors for you so you can fix them
You should also note that an evaluated expression's results will be typed-checked at runtime to ensure it returned a valid thing. As such iterator_eval
will yell at you if you make it return something other than a list
of bs4.Tag
.
Now, here is another example where we evaluate sel
instead of iterator
:
iterator: div
fields:
kind: id
first_item:
sel_eval: |
if element.get('id') == 'color':
return element.select_one('p')
return element.select_one('li')
this will return:
[
{
"kind": "colors",
"first_item": "Red"
},
{
"kind": "animals",
"first_item": "Tiger"
}
]
Finally, note that both sel_eval
or iterator_eval
may also return a single string that will subsequently be used as a CSS selector. As such, this example has the same behavior as the previous one:
iterator: div
fields:
kind: id
first_item:
sel_eval: |
if element.get('id') == 'color':
return 'p'
return 'li'
You can also evaluate python code to process and transform the scraped data to make it fit your needs. Let's consider the following html with case variations so ugly it hurts the eyes:
<div id="colors">
<p>Red</p>
<p>Blue</p>
<p>YELLOW</p>
<p>orange</p>
<p>ReD</p>
</div>
Being a diligent data wrangler, you might want to lowercase all those strings to normalize them. This is where the eval
key will be able to help you:
---
iterator: p
item:
eval: value.lower()
uniq: yes
and you will get:
["red", "blue", "yellow", "orange"]
In this expression, the value
variable represents the current value that will be returned and is by default (unless you gave a default
key) the text of the current html element.
And notice how the uniq
key applies after your transformation.
Finally, keep in mind that eval
is the last thing that gets executed. So if you declared an attr
key, for instance, then the value
variable will be this attribute's value when the eval
is run.
To finely understand in which order operations are applied, be sure to read this part of the tutorial.
The last thing one can "evaluate" is filtering. Let's consider this html:
<ul>
<li data-status="active">John</li>
<li data-status="inactive">Mary</li>
<li data-status="active">Susan</li>
</ul>
Now let's imagine you need to filter out inactive
items. We could do so using the filter_eval
key (once again, I perfectly know that a finely tuned CSS selector could do the job just fine, but I need an example so let this be a cautionary tale that evaluation is seldom necessary):
iterator: li
filter_eval: "value != 'inactive'"
and, as expected, this will return:
["John", "Mary"]
If you build your scraper declaration directly in python, you can replace all the evaluation keys by functions. So, reusing this html from a previous example:
<div id="colors">
<p>Red</p>
<p>Blue</p>
<p>YELLOW</p>
<p>orange</p>
</div>
You could build your scraper thusly:
from minet import Scraper
# the `**kwargs` part is necessary because many arguments will always be given
# so just pick the one you are interested in
def process_value(value, **kwargs):
return value.upper()
scraper = Scraper({
'iterator': 'p',
'item': {
'eval': process_value
}
})
scraper(html)
>>> ['RED', 'BLUE', 'YELLOW', 'ORANGE']
Keyword arguments given to the function replacing eval
directives are listed in next section (minus the dependencies part, which does not make sense if you can work in python directly since you can import them yourself).
Dependencies
- json: python's
json
module. - urljoin: python's
urllib.parse.urljoin
. - re: python's
re
module. - soupsieve: the
soupsieve
module.
Helpers
- get_display_text: function extracting the text of the given html element while attempting to render its whitespaces the same way a browser would.
- parse_date: function parsing dates (even relative ones such as
53 minutes ago
) and returning them formatted as ISO datetime. Can take alang
kwarg if required.
Local variables
- element bs4.Tag: currently selected html element.
- elements list<bs4.Tag>: html elements being currently iterated over.
- value any: current output value.
Context
- context dict: local context object (more about this in the next part of this documentation).
- root bs4.Tag: root element of the html tree.
- scope scope: a scraper-level global scope object that can be used to track state.
When calling a scraper on some html, it is also possible to give the scraper some context, as a dict
, so that the scraper can use it.
Here is how someone would pass this context in python:
from minet import Scraper
scraper = Scraper('scraper.yml')
scraper(html, context={'important_id': 123, 'nested': {'value': 'hello'}})
So, considering this html:
<ul>
<li>John</li>
<li>Susan</li>
</ul>
You would be able to access the given context by using the get_context
key, like so:
---
iterator: li
fields:
id:
get_context: important_id
name: text
to return:
[
{"id": 123, "name": "John"},
{"id": 123, "name": "Mary"}
]
Note that you can give a path instead of a single string to access nested information:
iterator: li
fields:
id:
get_context: important_id
value:
get_context: ['nested', 'value'] # could also be written as "nested.value"
name: text
to obtain:
[
{"id": 123, "value": "hello", "name": "John"},
{"id": 123, "value": "hello", "name": "Mary"}
]
This can be very useful to give access to some extraneous data to the scraper if required. As such when using the minet scrape
command, useful information is always given to the scraper through context:
- row dict: the CSV row representing a row in a
minet fetch
report. - path str: the path to the html file currently being scraped.
- basename str: the html file path's base name.
Here is an example where the scraped file's url is added to the output using context:
---
fields:
url:
get_context: line.url
title:
sel: title
Sometimes, for clarity or performance, you might want to override global context with some local one, through scraping. Let's consider the following html:
<html>
<head>
<title>Last week's posts</title>
</head>
<body>
<ul>
<li>
<p>
Post n°<strong>1</strong> by <em>Allan</em>
</p>
</li>
<li>
<p>
Post n°<strong>2</strong> by <em>Susan</em>
</p>
</li>
</ul>
</body>
</html>
If you want to return a list of dicts repeating the title of the document along with each post, you might do it like so:
---
iterator: li > p
fields:
title:
eval: "root.select_one('title').get_text()"
post:
sel: strong
author:
sel: em
and you would obtain:
[
{"title": "Last week's posts", "post": "1", "author": "Allan"},
{"title": "Last week's posts", "post": "2", "author": "Susan"},
]
but this is not 1. elegant (because you used evaluation) and 2. performant (because you will need to select the title
tag for each post).
This is when the set_context
key can be used to alleviate both issues. This key can take a recursive scraper definition and will set keys in a "local" context that will be merged with the global one given to the scraper, at each relevant level.
Here is how we could refactor the previous scraper:
---
set_context:
title:
sel: title
iterator: li > p
fields:
title:
get_context: title
post:
sel: strong
author:
sel: em
Finally, note that set_context
can be used at any level of the scraper's declaration so that this html:
<div data-topic="science">
<ul>
<li>
<p>
Post n°<strong>1</strong> by <em>Allan</em>
</p>
</li>
<li>
<p>
Post n°<strong>2</strong> by <em>Susan</em>
</p>
</li>
</ul>
</div>
<div data-topic="arts">
<ul>
<li>
<p>
Post n°<strong>3</strong> by <em>Josephine</em>
</p>
</li>
<li>
<p>
Post n°<strong>4</strong> by <em>Peter</em>
</p>
</li>
</ul>
</div>
could be scraped as:
[
[
{"topic": "science", "post": "1", "author": "Allan"},
{"topic": "science", "post": "2", "author": "Susan"},
],
[
{"topic": "arts", "post": "3", "author": "Josephine"},
{"topic": "arts", "post": "4", "author": "Peter"}
]
]
by using the following declaration:
---
iterator: div
item:
set_context:
topic: data-topic
iterator: li > p
fields:
topic:
get_context: topic
post:
sel: strong
author:
sel: em
Know that some of the DSL keys have aliases for convenience:
sel
can also be written as$
(which is reminiscent of web development)iterator
can also be written as$$
So don't be surprised if you see them appear in people code's sometimes and don't be afraid to use them if they fit your mindset better.
- Selection of a single element is made using
sel
,$
orsel_eval
. - Selection of iteratees is made if
iterator
oriterator_eval
is present, in which case the scraper output will be a list at this level. set_context
runs to create a local context.- If we find that we are collecting
fields
, we recurse for each key. - Else we collect our
item
by recursing and finally resolving a leaf directive (i.e. if no subiterator
,iterator_eval
,item
norfields
were given):- If nothing is given, we collect the current html element's text, if a
default
value was not provided instead. - We apply
attr
, orextract
orget_context
. - We run
eval
. - If the result is
None
, we set it to thedefault
value if provided.
- If nothing is given, we collect the current html element's text, if a
- If working on a list, we filter values through
filter
orfilter_eval
. - If working on a list, we only keep unique items through
uniq
.
Sometimes it can be useful to move up the tree when selecting elements. Unfortunately, CSS has no mechanism for doing so and I recommend to use evaluation to do so, for the time being.
So considering the following html:
<div data-topic="science">
<ul>
<li>
<p>
Post n°<strong>1</strong> by <em>Allan</em>
</p>
</li>
<li>
<p>
Post n°<strong>2</strong> by <em>Susan</em>
</p>
</li>
</ul>
</div>
<div data-topic="arts">
<ul>
<li>
<p>
Post n°<strong>3</strong> by <em>Josephine</em>
</p>
</li>
<li>
<p>
Post n°<strong>4</strong> by <em>Peter</em>
</p>
</li>
</ul>
</div>
you could obtain:
[
{"topic": "science", "post": "1", "author": "Allan"},
{"topic": "science", "post": "2", "author": "Susan"},
{"topic": "arts", "post": "3", "author": "Josephine"},
{"topic": "arts", "post": "4", "author": "Peter"}
]
by using the following scraper:
iterator: li
fields:
topic:
sel_eval: "element.find_parent('div')"
attr: data-topic
post:
sel: p > strong
author:
sel: p > em
Let's consider the following html:
<div>
<span>Hello</span>
<span>world!</span>
</div>
using item: text
when scraping this will yield the following text:
Hello
world!
Notice how the content is on two different lines, mimicking how the html was written. But this is not how a web browser would display this html. Rather, a web browser would display this html thusly (CSS, notwithstanding):
Hello world!
So if you need to extract raw text from complex html constructions, you should probably use item: display_text
instead, so that the scraper will try to render the html's whitespace the same way a web browser would.
This can be very important if you want to perform typical NLP tasks on the result, for instance.
CSS used to have a draft :contains()
pseudo-selector that was finally dropped but can still be used in jQuery
for instance.
Fortunately, as minet
relies on soupsieve
for CSS selection, you remain free to use the equivalent :-soup-contains()
documented here:
iterator: li
fields:
title:
sel: strong
url:
sel: 'a:-soup-contains("Full Story")'
attr: href