Skip to content

PRO-2684/pURLfy

Repository files navigation

pURLfy

English | 简体中文

The ultimate URL purifier.

Note

Do you know that the name "pURLfy" is a combination of "purify" and "URL"? It can be pronounced as pjuɑrelfaɪ.

🪄 Features

Usually, pURLfy is used for purifying URL, including removing redundant tracking parameters, skipping redirecting pages, and extracting the link that really matters. However, pURLfy is not limited to this. It is actually a powerful rule-based tool for transforming URLs, and example use cases include replacing the domain name and redirecting to an alternative of the given URL etc. It features:

  • ⚡ Fast: Purify URLs quickly and efficiently.
  • 🪶 Lightweight: Zero-dependency; Minified script less than 4kb.
  • 📃 Rule-based: Perform purification based on rules, making it more flexible.
  • 🔄️ Async: Calling purify won't block your thread.
  • 🔁 Iterative purification: If the URL still contains tracking parameters after a single purification (e.g. URLs returned by redirect rules), it will continue to be purified.
  • 📊 Statistics: You can track statistics of the purification process, including the number of links purified, the number of parameters removed, the number of URLs decoded, the number of URLs redirected, and the number of characters deleted, etc.

🤔 Usage

🚀 Quick Start

Visit our demo page, try out our Tampermonkey script, or simply node cli.js <url[]> [<options>] to purify a list of URLs (For more information, please refer to the comments in the script).

// Somewhat import `Purlfy` class from https://cdn.jsdelivr.net/gh/PRO-2684/pURLfy@latest/purlfy.min.js
const purifier = new Purlfy({ // Instantiate a Purlfy object
    fetchEnabled: true,
    lambdaEnabled: true,
});
const rules = await (await fetch("https://cdn.jsdelivr.net/gh/PRO-2684/pURLfy-rules@core-0.3.x/<ruleset>.json")).json(); // Rules
// You may also use GitHub raw link for really latest rules: https://raw.githubusercontent.com/PRO-2684/pURLfy-rules/core-0.3.x/<ruleset>.json
const additionalRules = {}; // You can also add your own rules
purifier.importRules(rules, additionalRules); // Import rules
purifier.addEventListener("statisticschange", e => { // Add an event listener for statistics change
    console.log("Statistics increment:", e.detail); // Only available in platforms that support `CustomEvent`
    console.log("Current statistics:", purifier.getStatistics());
});
purifier.purify("https://example.com/?utm_source=123").then(console.log); // Purify a URL

Here's a list of test URLs that you can use to test pURLfy:

  • Bilibili's short link: https://b23.tv/SI6OEcv
  • Ordinary Tieba link: https://tieba.baidu.com/p/7989575070?share=none&fr=none&see_lz=none&share_from=none&sfc=none&client_type=none&client_version=none&st=none&is_video=none&unique=none
  • MC Wiki's external link: https://link.mcmod.cn/target/aHR0cHM6Ly9naXRodWIuY29tL3dheTJtdWNobm9pc2UvQmV0dGVyQWR2YW5jZW1lbnRz
  • Bing's search result: https://www.bing.com/ck/a?!&&p=de70ef254652193fJmltdHM9MTcxMjYyMDgwMCZpZ3VpZD0wMzhlNjdlMy1mN2I2LTZmMDktMGE3YS03M2JlZjZhMzZlOGMmaW5zaWQ9NTA2Nw&ptn=3&ver=2&hsh=3&fclid=038e67e3-f7b6-6f09-0a7a-73bef6a36e8c&psq=anti&u=a1aHR0cHM6Ly9nby5taWNyb3NvZnQuY29tL2Z3bGluay8_bGlua2lkPTg2ODkyMg&ntb=1
  • A URL nested too many times that cannot be opened normally: https://www.minecraftforum.net/linkout?remoteUrl=https%3A%2F%2Fwww.urlshare.cn%2Fumirror_url_check%3Furl%3Dhttps%253A%252F%252Fc.pc.qq.com%252Fmiddlem.html%253Fpfurl%253Dhttps%25253A%25252F%25252Fgithub.com%25252Fjiashuaizhang%25252Frpc-encrypt%25253Futm_source%25253Dtest

📚 API

Constructor

new Purlfy({
    fetchEnabled: Boolean, // Enable the redirect mode (default: false)
    lambdaEnabled: Boolean, // Enable the lambda mode (default: false)
    maxIterations: Number, // Maximum number of iterations (default: 5)
    statistics: { // Initial statistics
        url: Number, // Number of links purified
        param: Number, // Number of parameters removed
        decoded: Number, // Number of URLs decoded (`param` mode)
        redirected: Number, // Number of URLs redirected (`redirect` mode)
        visited: Number, // Number of URLs visited (`visit` mode)
        char: Number, // Number of characters deleted
    },
    log: Function, // Log function (default is using `console.log` for output)
    fetch: async Function, // Function to fetch the given URL, should at least support `method`, `headers` and `redirect` in `options` parameter (default is using `fetch`)
})

Instance Methods

  • importRules(...rulesets: object[]): void: Import a series of rulesets.
  • purify(url: string): Promise<object>: Purify a URL.
    • url: The URL to be purified.
    • Returns a Promise that resolves to an object containing:
      • url: string: The purified URL.
      • rule: string: The matched rule.
  • clearStatistics(): void: Clear statistics.
  • clearRules(): void: Clear all imported rules.
  • getStatistics(): object: Get statistics.
  • addEventListener("statisticschange", callback: function): void: Add an event listener for statistics change.
    • The callback function will receive an CustomEvent / Event object based on whether the platform supports it.
    • If platform supports CustomEvent, the detail property of the event object will contain the incremental statistics.
  • removeEventListener("statisticschange", callback: function): void: Remove an event listener for statistics change.

Instance Properties

You can change these properties after instantiation, and they will take effect for the next call to purify.

  • fetchEnabled: Boolean: Whether the redirect mode is enabled.
  • lambdaEnabled: Boolean: Whether the lambda mode is enabled.
  • maxIterations: Number: Maximum number of iterations.

Static Properties

  • Purlfy.version: string: The version of pURLfy.

📖 Rulesets

Community-contributed rulesets are hosted on GitHub, and you can find them at pURLfy-rules. The format of a ruleset file is as follows:

{
    "<domain>": {
        "<path>": {
            // A single rule
            "description": "<description>",
            "mode": "<mode>",
            // Other parameters
            "author": "<author>"
        },
        // ...
    },
    // ...
}

Formal definition of the format can be found at ruleset.schema.json.

✅ Path Matching

<domain>, <path>: The domain and a part of path, such as example.com/, /^.+\.example\.com$, path/ and page. Here's an explanation of them:

  • The basic behavior is like paths on Unix file systems.
    • If not ending with /, its value will be treated as a rule.
    • If ending with /, there's more paths under it, like "folders" (theoretically, you can nest infinitely)
    • / is not allowed in the middle of <domain> or <path>.
  • Note that if it starts with /, it will be treated as a RegExp pattern.
    • For example, /^.+\.example\.com$ will match all subdomains of example.com, and /^\d+$ will match a part of path that contains only digits.
    • Do remember to escape \, . etc in JSON strings.
    • Empty regex will be ignored. (i.e. / or //)
    • Using RegExp is not recommended unless necessary, since it will slow down the matching process.
  • If it's an empty string, it will be treated as a FallBack rule: this rule will be used when no other rules are matched at this level.
  • If there's multiple rules matched, the best matched rule will be used. (Exact match > RegExp match > FallBack rule)
  • If you want a rule to match all paths under a domain, you can omit <path>, but remember to remove the / after the domain.

A simple example with comments showing the URLs that can be matched:

{
    "example.com/": {
        "a": {
            // The rule here will match "example.com/a"
        },
        "path/": {
            "to/": {
                "page": {
                    // The rule here will match "example.com/path/to/page"
                },
                "/^\\d+$": { // Remember to escape `\`
                    // The rule here will match all paths under "example.com/path/to/" that are composed of digits
                },
                "": {
                    // The rule here will match "example.com/path/to", excluding "page" and digits under it
                }
            },
            "": {
                // The rule here will match "example.com/path", excluding "to" under it
            }
        },
        "": {
            // The rule here will match "example.com", excluding "path" under it
        }
    },
    "example.org": {
        // The rule here will match every path under "example.org"
    },
    "": {
        // Fallback: this rule will be used for all paths that are not matched
    }
}

Here's an erroneous example:

{
    "example.com/": {
        "path/": { // Path ending with `/` will be treated as a "directory", thus you should remove the trailing `/`
            // Attempting to match "example.com/path"
        }
    },
    "example.org": { // Path not ending with `/` will be treated as a rule, thus you should add a trailing `/`
        "page": {
            // Attempting to match "example.org/page"
        }
    },
    "example.net/": {
        "path/to/page": { // Can't contain `/` in the middle - you should nest them
            // Attempting to match "example.net/path/to/page"
        },
        "/^\d+$": { // `\d` won't parse correctly in JSON strings, so use `\\d` instead
            // Attempting to match all paths under "example.net/" that are composed of digits
        }
    }
}

📃 A Single Rule

Paths not ending with / will be treated as a single rule, and there's multiple modes for a rule. The common parameters are as follows:

{
    "description": "<Rule Description>",
    "mode": "<Mode>",
    // Mode-specific parameters
    "author": "<Author>"
}

This table shows supported parameters for each mode:

Param\Mode white black param regex redirect visit lambda
std
params
acts
regex
replace
ua
headers
lambda
continue

🟢 Whitelist Mode white

Param Type Default
params string[] Required

Under Whitelist mode, only the parameters specified in params will be kept, and others will be removed. Usually this is the most commonly used mode.

🔴 Blacklist Mode black

Param Type Default
params string[] Required
std Boolean false

Under Blacklist mode, the parameters specified in params will be removed, and others will be kept. std is for controlling whether the URL search string shall be deemed standard. Only if it is true or the URL search string is indeed standard will the URL be processed.

🟤 Specific Parameter Mode param

Param Type Default
params string[] Required
acts string[] ["url"]
continue Boolean true

Under Specific Parameter mode, pURLfy will:

  1. Attempt to extract the parameters specified in params in order, until the first existing parameter is matched.
  2. Decode the parameter value using the processors specified in the acts array in order (if any acts value is invalid or throws an error, it is considered a failure and the original URL is returned).
  3. Use the final result as the new URL.
  4. If continue is not set to false, purify the new URL again.

🟣 Regex Mode regex

Param Type Default
acts string[] []
regex string[] Required
replace string[] Required
continue Boolean true

Under Regex mode, pURLfy will, for each regex-replace pair:

  1. Match the RegExp pattern specified in regex against the URL.
  2. Replace all matched parts with the "replacement string" specified in replace.
  3. Decode the result using the processors specified in the acts array in order (if any acts value is invalid or throws an error, it is considered a failure and the original URL is returned).

If you'd like to learn more about the syntax of the "replacement string", please refer to the MDN documentation.

🟡 Redirect Mode redirect

Caution

For compatibility reasons, the redirect mode is disabled by default. Refer to the API documentation for enabling it.

Param Type Default
ua string undefined
headers object {}
continue Boolean true

Under Redirect mode, pURLfy will call constructor parameter fetch to get the redirected URL, by firing a HEAD request using headers as the headers to the matched URL and return the Location header or the updated response.url. If continue is not set to false, the new URL will be purified again.

Note: ua parameter will be deprecated in the future, and you should use headers to set the User-Agent header.

🟠 Visit Mode visit

Caution

For compatibility reasons, the redirect mode is disabled by default. Refer to the API documentation for enabling it.

Param Type Default
ua string undefined
headers object {}
acts string[] ["regex:<url_pattern>"]
continue Boolean true

Under Visit mode, pURLfy will visit the URL with headers as the headers, and if the URL has not beed redirected, it will call the processors specified in acts in order (<url_pattern> is https?:\/\/.(?:www\.)?[-a-zA-Z0-9@%._\+~#=]{2,256}\.[a-z]{2,6}\b(?:[-a-zA-Z0-9@:%_\+.~#?!&\/\/=]*)). The initial input to acts is of type string, i.e. the text returned by visiting the URL. If the URL has been redirected, the redirected URL will be returned. If continue is not set to false, the new URL will be purified again.

Note: ua parameter will be deprecated in the future, and you should use headers to set the User-Agent header.

🔵 Lambda Mode lambda

Caution

For security reasons, the lambda mode is disabled by default. If you trust the rules provider, refer to the API documentation for enabling it.

Param Type Default
lambda string Required
continue Boolean true

Under Lambda mode, pURLfy will try to execute the lambda function specified in lambda and use the result as the new URL. The function shall be async, and its body should accept a single URL parameter url and return a new URL object. For example:

{
    "example.com": {
        "description": "example",
        "mode": "lambda",
        "lambda": "url.searchParams.delete('key'); return url;",
        "continue": false,
        "author": "PRO-2684"
    },
    // ...
}

If URL https://example.com/?key=123 matches this rule, the key parameter will be deleted. After this operation, since continue is set to false, the URL returned by the function will not be purified again. Of course, this is not a good example, because this can be achieved by using Blacklist mode.

🖇️ Processors

Some processors support parameters, simply append them to the function name separated by a colon (:): func:arg. The following processors are currently supported:

  • url: string->string, URL decoding (decodeURIComponent)
  • base64: string->string, Base64 decoding of UTF-8 strings (Adapted from MDN)
  • slice:start:end: string->string, String slicing (s.slice(start, end)), start and end will be converted to integers
  • regex:<regex>: string->string, regex matching, returns the first match of the regex or an empty string if no match is found
  • dom: string->Document, parse the string as a HTML Document object (you'll need to define DOMParser globally if using in Node.js)
  • sel:<selector>: Any->Element/null, select the first element using CSS selector <selector> (The input shall have querySelector method)
  • attr:<attribute>: Element->string, get the value of the attribute <attribute> of the element (getAttribute)
  • text: Element->string, get the text content of the element (textContent)

😎 Projects Using pURLfy

Tip

If you are using pURLfy in your project, feel free to submit a PR to add your project here!

🎉 Acknowledgments

⭐ Star History

Stargazers over time