Skip to content

Latest commit

 

History

History
160 lines (119 loc) · 12.8 KB

README.md

File metadata and controls

160 lines (119 loc) · 12.8 KB

webis logo scriptor logo

Webis Scriptor

Plug-and-play reproducible web analysis.

latest version npm workflow docker workflow license playwright version

Scriptor runs your web analyses on rendered web pages in an up-to-date browser. It owes much of its power to the Playwright browser automation library, but integrates pywb's archiving and replay capabilities for provenance and reproducibility. Use cases are as diverse as high-fidelity web archiving, content extraction, and web user simulation.

Installation

Make sure you have both Docker and a recent NodeJS installation. If you do not want to install NodeJS, you can also run the Docker container directly.

# install packages to run './bin/scriptor.js':
npm install --omit=dev

# install into system path to run 'scriptor', may require sudo or similar:
npm install --global
# if scriptor can not be found, set the node path (adjust to your system):
export NODE_PATH=/usr/local/lib/node_modules/

Quickstart

To run scriptor you need the permission to execute docker run.

Take a snapshot:

scriptor --input "{\"url\":\"https://github.com/webis-de/scriptor\"}" --output-directory output1

Use an input directory for more configuration options (e.g., configure the browser with all options of Playwright):

scriptor --input docs/example/snapshot-input/ --output-directory output2

Replace the default script with an own one (see Developing Own Scripts):

scriptor --script-directory path/to/my/own/script --output-directory output3

Have a look at available features:

scriptor --help

Output Directory Structure

output/
├─ browserContexts/
|  └─ default/     # Shares the name of the browser context
|     ├─ userData/    # Browser files (cache, cookies, ...)
|     ├─ video/       # Recorded videos if --video is set
|     ├─ warcs/       # Recorded web archive collection
|     ├─ archive.har  # Recorded web archive in HAR format
|     ├─ browser.json # Used browser context options
|     └─ trace.zip    # Playwright trace
├─ id.txt         # Hash of the directory to identify it
├─ input-id.txt   # Hash of the input directory (if any)
└─ logs/
   └─ scriptor.log  # Container log

Scripts usually place additional data into the output directory. For example, the default script adds a snapshot.

The warcs directory is created using pywb and thus follows its directory structure. Note that efforts exist to standardize this structure: and they are looking for feedback!

To view the trace.zip, see the Playwright docs or just directly load it into the progressive web app.

Scriptor uses Bunyan for logging. The Bunyan CLI allows to filter and pretty-print the logs.

Running on Archives (Replay)

Scriptor can be configured to use resources from web archives instead of the live web. Use --replay to restrict to resources contained in the WARC files of the input or script directory. Use --replay rw to use these resources, but allow to fall back to the live web. Use --warc-input <warc-file-or-directory> to include resources in the specified file (or all files in a specified directory).

Developing Own Scripts

Create a Script.js and extend AbstractScriptorScript:

const { AbstractScriptorScript, files, pages } = require('@webis-de/scriptor');

module.exports = class extends AbstractScriptorScript {

  constructor() { super("MyScript", "0.1.0"); } // log script name and version
  
  async run(browserContexts, scriptDirectory, inputDirectory, outputDirectory) { }

}

The directory that contains your Script.js is called the "script directory": use the --script-directory option to specify it on the command line and your script's run method will be used instead of the one of the default script. The script and input directory are read-only. Everything the script produces should be written to the output directory.

Controlling the Browser(s) Each of the browserContexts is a Playwright BrowserContext object, roughly corresponding to a browser session. Your script can use the BrowserContext's newPage method to create a new Page (like a browser tab)—the object to open, read, and manipulate web pages. pages.js adds even more methods to this end.

If the script uses a single browser (the usual case), the run method should start with

const browserContext = browserContexts["default"];

which gets a browser context configured using the browserContexts/default/browser.json files in the script and input directory (specified by --input) if they exist. The following configuration precedence applies (lowest to highest): defaults < script directory browser.json < input directory browser.json < scriptor command line options (e.g., --show-browser). In addition to Playwright's options, the browserType option allows to specify which browser to use: "chromium" (default), "firefox", or "webkit".

Place directories inside browserContexts to receive correspondingly named browser contexts in run's browserContexts parameter. An output directory is created for each browser context.

Configuring the Script Most scripts have parameters, which should be specified in a config.json in the input directory—or by other options of --input. A config.json in the script directory can be used to specify defaults, though these could also be specified in the script's code. The recommended way for reading the JSON files is:

const defaultScriptOptions = { ... };
const requiredScriptOptions = [ ... ];
const scriptOptions = files.readOptions(files.getExisting(
    "config.json", [ scriptDirectory, inputDirectory ]),
  defaultScriptOptions, requiredScriptOptions);

Return Value and Chaining If a script allows to continue from its output with the same or a different script (see "chaining"), its run-method should return true, like the default script. By default, Scriptor stores the browser state (for each browser context) in the output directory so that it is loaded automatically when that output directory is used as the input directory for a new run. As a developer, you just have to take care that you store (updated, if necessary) all the input files for your script at the same location in the output directory. Chaining is intended to create "checkpoints" from which to continue after a crash or to serve as intermediate archives. Note that a script may return true in some cases and false in others.

Scriptor API Scriptor provides several static functions to assist you with manipulating Playwright pages or when dealing with the Scriptor directory structure. See the API documentation

Chaining

Usually, the output directory of Scriptor runs can serve as the input directory for a next run (as identified by the script's return value; see developing own scripts). To automate such chaining, use --chain [name] to create the series of output directories within --output-directory. A JSON-file in the --output-directory (identified by name) will be continuously updated to point the last successful run and read on start-up, so that you can execute the same scriptor command to continue from the last successful run if the chain aborted for some reason.

Manual Browser Interaction

Scriptor allows for manual interactions with the browser, which can be useful to set cookies or similar. Specifically, using the --show-browser option allows scripts to use the page.pause-method, which will pause the script until the user hits the resume button in the dialog that pops up. The same dialog also allows to record interactions as Javascript code. For such simple use cases, the Manual script can be used: it contains (in essence) only the call to pause.

Since Scriptor runs in a container, it can not directly open the browser window on your machine. Instead, it runs a VNC server inside the container that you can connect to with a VNC client at localhost:5942 to see the browser window. Depending on your operating system, you might already have a VNC client installed. If not, VNC Viewer is available for all major operating systems. The config options of --show-browser allow to change the width and height of the virtual display, change the port, allow remote access, and set a password. See --help.

If you want to run Scriptor on one machine and interact with it from another machine, make sure to read how to use x11vnc (Scriptor uses x11vnc as its VNC server), especially the sections on how to encrypt your traffic. By default, however, the Scriptor docker container is configured to accept only connections from the machine it is started on.

Running without NodeJS

At the cost of reduced convenience (timeout, nicer interface), you can run Scriptor with only a Docker installation:

docker run -it --rm \
  --volume <script-directory>:/script:ro \
  --volume <input-directory>:/input:ro \
  --volume <output-directory>:/output \
  ghcr.io/webis-de/scriptor:latest <parameters>
  • <script/input/output-directory> are the absolute paths to the respective directories
    • The <script-directory> line can be omitted to run the Snapshot script
    • The <input-directory> line can be omitted to not set --input or when the config is set by --input "{...}" in the <parameters>
  • <parameters> are additional options; see docker run -it --rm ghcr.io/webis-de/scriptor:latest --help

Chaining can also be used without NodeJS. However, the Docker container does exit after a single run (by design). Use the same command to continue the chain.