Skip to content

URLFinder

Mistralys edited this page Jun 21, 2021 · 14 revisions

The ConvertHelper_URLFinder class can detect URLs and Email addresses in a string.

Usage example

use AppUtils\ConvertHelper;

$subject = 'Lorem ipsum dolor https://mistralys.com tempor incididunt https://github.com ullamco laboris.';

$urls = ConvertHelper::createURLFinder($subject)->getURLs();

This will return an indexed array with the following URLs, in the order they are found:

https://mistralys.com
https://github.com

NOTE: Duplicate URLs are trimmed (see "Detecting duplicate URLs").

Finding Email addresses

Use getEmails() to fetch all Email addresses instead of the regular URLs.

use AppUtils\ConvertHelper;

$subject = 'Lorem ipsum dolor: info@mistralys.com tempor incididunt, webmaster@mistralys.com.';

$urls = ConvertHelper::createURLFinder($subject)->getEmails();

This will return the following email addresses, in the order they were found:

mailto:info@mistralys.com
mailto:webmaster@mistralys.com

Use enableSorting() to sort the results alphabetically.

Omitting the mailto scheme

As the previous example shows, all Email addresses are returned with the mailto: prefix. This can be deactivated with omitMailto().

use AppUtils\ConvertHelper;

$subject = 'Lorem ipsum dolor: info@mistralys.com tempor incididunt, webmaster@mistralys.com.';

$urls = ConvertHelper::createURLFinder($subject)
  ->omitMailto()
  ->getEmails();

This will return the following email addresses:

info@mistralys.com
webmaster@mistralys.com

Finding both URLs and Email addresses

By default, you have to use either getURLs() or getEmails() to fetch these results separately. With includeEmails(), getURL() will return both regular URLs and Email addresses. Sorting then works for both as well.

Sorting results

The resulting list can be sorted alphabetically by calling enableSorting():

use AppUtils\ConvertHelper;

$subject = 'Lorem ipsum dolor https://mistralys.com tempor incididunt https://github.com ullamco laboris.';

$urls = ConvertHelper::createURLFinder($subject)
  ->enableSorting()
  ->getURLs();

This will return the following URLs, sorted alphabetically:

https://github.com
https://mistralys.com

Detecting duplicate URLs

By default, the finder will recognize duplicates, but only if the URLs are exactly the same. Using a different case in the host, for example, will not detect that https://GitHub.com is the same as https://github.com.

The normalizing feature will detect such duplicates. It even detects URLs that have the same query parameters, but in a different order. This feature can be enabled using enableNormalizing().

use AppUtils\ConvertHelper;

$subject = '
// Different case
https://GitHub.com 
https://github.com
HTTPS://GITHUB.COM

// Same parameters, different order
https://github.com?paramB=bar&paramA=foo
https://github.com?paramA=foo&paramB=bar
';

$urls = ConvertHelper::createURLFinder($subject)
  ->enableNormalizing()
  ->getURLs();

This will return the following URLs:

https://github.com
https://github.com?paramA=foo&paramB=bar

Normalizing will work on any combination of case and parameters.

Finding relative paths in HTML documents

When finding URLs in HTML documents, its is also possible to extract all relative URLs from known tag attributes like href and src, by enabling the feature with enableHTMLAttributes():

use AppUtils\ConvertHelper;

$html = 
'<html>'.
  '<head>'.
    '<script src="libraries/js/site.js"></script>'.
    '<link href="libraries/css/layout.css">'.
  '</head>'.
  '<body>'.
    '<a href="https://github.com">GitHub</a>'.
  '</body>'.
'</html>';

$urls = ConvertHelper::createURLFinder($html)
  ->enableHTMLAttributes()
  ->getURLs();

This will extract the following URLs from the document:

libraries/js/site.js
libraries/css/layout.css
https://github.com

This will work only in HTML or XML documents.

Automatically parse URLs

Use the method getInfos() to retrieve URLInfo instances instead of strings. This allows direct access to information on the URLs.

New here?

Have a look at the overview for a list of all helper classes available in the package.

Table of contents

Find the current page in the collapsible "Pages" list above, and expand the page, to view a table of contents.

Clone this wiki locally