-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create basic matcher and common file for people to contribute URLs to #4
Comments
Think it'll be a good idea to create a json file with the following schema: {"urls":[ So each element consists or :{object} . The object properties are source : This is the source of information that says that a company is actually involved in contributing to government surveillance isDirect: A boolean flag to indicate if this company is the one that is actually the signatory. Example, heroku did not sign the CISA bill but is owned by Salesforce that did sign the bill. Therefore sales force would be direct, while heroku is not. parent: If there is a parent (in the case where isDirect is false), the url to the parent site description: A short blurb describing how the company is involved in government surveillance programs |
This introduces tests to check if urls without http at the start return false when checked if they are part of the black list. This actually feels unwanted since it anyway wouldn't match the blacklist. But keeping it for now. Also wrote the matching function to take in a parameter of a url and a dictionary of urls to check against. The matcher function will be responsible for stripping URLs down to root domain names to check if they can be found inside the dictionary. Currently it returns true or false, but I'll want it to return something more informative later like an empty string for no matches and the 'key' value for the dictionary if it does find a matching url.
The url matcher can now take a url of the formats * https://domain.com * http://domain.com * http://www.domain.com * http://www.sub.domain.com * http://sub.domain.com * http://sub.domain.com/ and a few more and match it to the key 'domain.com' It does this by first removing the protocol part of the url. It then takes out the www. part if it exists It proceeds to remove the ending slash It then checks if it can find the url thus far inside the dictionary. This is done so that we can match a sub domain only instead of always matching the full root domain. Example would be a company blog that resides on a blogging service. The blogging service shouldn't be penalised, but <domain>.blogginservice.com should.
No description provided.
The text was updated successfully, but these errors were encountered: