-
Notifications
You must be signed in to change notification settings - Fork 45
04 BlackWidow
BlackWidow is a .NET library based on SharpGrabber. Rather than relying on .NET assemblies, BlackWidow executes scripts written specifically for grabbing.
BlackWidow gives you the following advantages over the traditional NuGet package approach:
- Always Up-to-date: The scripts are always kept up-to-date at runtime; so the functionality of the host application won't break as the sources change - at least not for long!
- ECMAScript Support: Supports JavaScript/ECMAScript out of the box.
- Easy Maintenance: JavaScript is darn easy to write and understand! This helps contributors to quickly write new grabbers or fix the existing ones.
- Secure: The scripts are executed in a sandbox environment, and they only have access to what the BlackWidow API exposes to them.
- Highly Customizable: Almost everything is open for extension or replacement. Make new script interpreters, custom grabber repositories, or roll out your own interpreter APIs
To understand how BlackWidow works, first we need to introduce Grabber Repositories. A Grabber Repository is a collection of scripts written specifically for grabbing, each with its own identifier, version information and metadata. Grabber Repositories store and read from various sources; such as local disk, GitHub repository etc.
The BlackWidow library provides a service that takes two repositories, a local one, and a remote one. Since grabbers are always loaded locally, the service tries to download new scripts from the remote repository and save them locally. It also watches the local repository for changes, and whenever a script is added or changed, it reads the JavaScript files. After loading them into memory, they are registered as implementations of IGrabber
.
Install the NuGet package.
Install-Package SharpGrabber.BlackWidow -Version 1.1
This GitHub repository provides a collection of grabber scripts. You can find all of the officially available grabber scripts here. All contributions to the grabber scripts should be done on this directory and merged through PRs.
var service = await BlackWidowBuilder.New()
.SetScriptHost(ScriptHost = new())
.ConfigureInterpreterService(icfg => icfg.AddJint())
.ConfigureLocalRepository(cfg => cfg.UsePhysical(@"blackwidow/repo"))
.ConfigureRemoteRepository(cfg => cfg.UseOfficial())
.BuildAsync();
-
SetScriptHost
sets theIScriptHost
, which is an object with callback methods that are called directly from the script, such asconsole.log
. -
ConfigureInterpreterService
configures the BlackWidow service to add JavaScript support usingJint
. - Using
ConfigureLocalRepository
andUsePhysical
we specify a local path relative to current directory. - Using
ConfigureRemoteRepository
andUseOfficial
we declare that our remote source-of-truth is actually the official SharpGrabber GitHub repository.
BlackWidow service provides an IGrabber
instance. This special implementation of IGrabber
is dynamic, meaning that it has an always up-to-date internal list of grabbers, each associated with a local script.
// while building the Multi-Grabber
var grabber = GrabberBuilder.New()
.UseDefaultServices()
.Add(service)
This section describes how one can write grabber scripts in JavaScript.
A grabber script should expose two methods to its BlackWidow host, named supports
and grab
- as if implementing IGrabber
.
The host defines a variable named grabber
in the global scope. The script should set the grab methods on this value.
This method tests if this grabber supports the input URL.
grabber.supports = url => /^https?:\/\/(www\.)?example\.com/.test(url)
This method processes request by grabbing information from the provided URL, and updates result.
grabber.grab = (request, result) => {
const url = request.url
result.title = '<title of the video!>'
result.grab('info', {
author: 'Some guy',
length: 60, // 1m
})
...
}
Work in progress