Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create basic matcher and common file for people to contribute URLs to #4

Open
kiriappeee opened this issue Sep 23, 2015 · 1 comment

Comments

@kiriappeee
Copy link
Member

No description provided.

@kiriappeee
Copy link
Member Author

Think it'll be a good idea to create a json file with the following schema:

{"urls":[
"example.com" : {"source":"http://proofofexample.com", "isDirect": true, "parent": " topexample.com", "description" : "This company is owned by a company that is a signatory of the CISA bill"} ]
}

So each element consists or :{object} . The object properties are

source : This is the source of information that says that a company is actually involved in contributing to government surveillance

isDirect: A boolean flag to indicate if this company is the one that is actually the signatory. Example, heroku did not sign the CISA bill but is owned by Salesforce that did sign the bill. Therefore sales force would be direct, while heroku is not.

parent: If there is a parent (in the case where isDirect is false), the url to the parent site

description: A short blurb describing how the company is involved in government surveillance programs

kiriappeee added a commit that referenced this issue Sep 23, 2015
This introduces tests to check if urls without http at the start return
false when checked if they are part of the black list. This actually
feels unwanted since it anyway wouldn't match the blacklist. But keeping
it for now. Also wrote the matching function to take in a parameter of a
url and a dictionary of urls to check against. The matcher function will
be responsible for stripping URLs down to root domain names to check if
they can be found inside the dictionary. Currently it returns true or
false, but I'll want it to return something more informative later like
an empty string for no matches and the 'key' value for the dictionary if
it does find a matching url.
kiriappeee added a commit that referenced this issue Sep 23, 2015
The url matcher can now take a url of the formats

* https://domain.com
* http://domain.com
* http://www.domain.com
* http://www.sub.domain.com
* http://sub.domain.com
* http://sub.domain.com/

and a few more and match it to the key

'domain.com'

It does this by first removing the protocol part of the url.

It then takes out the www. part if it exists

It proceeds to remove the ending slash

It then checks if it can find the url thus far inside the dictionary.
This is done so that we can match a sub domain only instead of always
matching the full root domain. Example would be a company blog that
resides on a blogging service. The blogging service shouldn't be
penalised, but <domain>.blogginservice.com should.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant