Create basic matcher and common file for people to contribute URLs to #4

kiriappeee · 2015-09-23T06:44:06Z

No description provided.

kiriappeee · 2015-09-23T07:47:15Z

Think it'll be a good idea to create a json file with the following schema:

{"urls":[
"example.com" : {"source":"http://proofofexample.com", "isDirect": true, "parent": " topexample.com", "description" : "This company is owned by a company that is a signatory of the CISA bill"} ]
}

So each element consists or :{object} . The object properties are

source : This is the source of information that says that a company is actually involved in contributing to government surveillance

isDirect: A boolean flag to indicate if this company is the one that is actually the signatory. Example, heroku did not sign the CISA bill but is owned by Salesforce that did sign the bill. Therefore sales force would be direct, while heroku is not.

parent: If there is a parent (in the case where isDirect is false), the url to the parent site

description: A short blurb describing how the company is involved in government surveillance programs

This introduces tests to check if urls without http at the start return false when checked if they are part of the black list. This actually feels unwanted since it anyway wouldn't match the blacklist. But keeping it for now. Also wrote the matching function to take in a parameter of a url and a dictionary of urls to check against. The matcher function will be responsible for stripping URLs down to root domain names to check if they can be found inside the dictionary. Currently it returns true or false, but I'll want it to return something more informative later like an empty string for no matches and the 'key' value for the dictionary if it does find a matching url.

The url matcher can now take a url of the formats * https://domain.com * http://domain.com * http://www.domain.com * http://www.sub.domain.com * http://sub.domain.com * http://sub.domain.com/ and a few more and match it to the key 'domain.com' It does this by first removing the protocol part of the url. It then takes out the www. part if it exists It proceeds to remove the ending slash It then checks if it can find the url thus far inside the dictionary. This is done so that we can match a sub domain only instead of always matching the full root domain. Example would be a company blog that resides on a blogging service. The blogging service shouldn't be penalised, but <domain>.blogginservice.com should.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create basic matcher and common file for people to contribute URLs to #4

Create basic matcher and common file for people to contribute URLs to #4

kiriappeee commented Sep 23, 2015

kiriappeee commented Sep 23, 2015

Create basic matcher and common file for people to contribute URLs to #4

Create basic matcher and common file for people to contribute URLs to #4

Comments

kiriappeee commented Sep 23, 2015

kiriappeee commented Sep 23, 2015