forked from ecoron/GoogleScraper
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
223 lines (170 loc) · 9.94 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
- edit caching with decorator pattern
- add all google search params to config
- write functional tests
- add sqlalchemy support for results
- add better proxy handling
- extend parsing functionality
- update readme
- prevent parsing config two times
04.11.2014:
- Refactor code, change docstrings to google format
https://google-styleguide.googlecode.com/svn/trunk/pyguide.html#Commentss [done]
15.11.2014:
- add shell access with sqlalchemy session [done]
- test selenium mode thoroughly [done]
- double check selectors
- add alternative selectors
- Add gevent support
- make all modes workable through proxies [done for http and sel]
- update README [done]
- write blog post that illustrates usage of GoogleScraper
- some testing
- release version 0.2.0 on the cheeseshop
- released version 0.1.5 on pypy [done]
11.12.2014
- JSON output is still slightly corrupt
- CSV output probably also not ideal.
- Improve documentation after Google style guide
- Maybe add other search engines!
- finally implement async mode!!!
30.12.2014:
- Fixed issue #45 [done]
02.01.2015:
- Check output verbosity levels and modify them. [done]
13.01.2015:
- Handle sigint. Then close all open files (csv, json).
15.01.2015:
- Implement JSON static tests [done]
- Implement CSV static tests [done]
- Catch Resource warnings in testing [done]
- Add no_results_selectors for all SE [done]
- add test for no_results_selectors parsing [done]
- Add page number selectors for all SE [done]
- add static tests [done]
- add fabfile (google a basic template) for []
- adding & committing and uploading to master []
- push to the cheeseshop []
- add function in fabfile that pushes to cheeseshop only after all tests were successful []
- Add functionality that distinguishes the page number of serp pages when caching []
- implement async mode [done]
- reade 20 minutes about asyncio built in moduel and decide whether if feets my needs [done]
18.01.2015
- add four different examples:
- a basic usage [done]
- using selenium mode [done]
- using http mode [done]
- using async mode [done]
- scraping with a keywords.py module
- scraping images [done]
- finding plagiarized content [done]
- Add dynamic tests for selenium mode:
- Add event: No results for this query.
- Test Impossible query:
-> Cannot have next_results page
-> No results [done]
-> But still save serp page. [done]
-> add to missed keywords []
- What is the best way to detect that the page loaded????
-> Research, read about selenium
- Add test for duckduckgo
- Fix: If there's no internet connection, Malicious request detected is show. Show no internet connection instead.
- FIGURE OUT: WHY THE HELLO DOES DUCKDUCKGO NOT WORK IN PHANTOMJS?
05.10.2015
- Switch configuration from INI format to plain python code [Done]
- recode parse logic for configuration [Done]
Command Line Settings > Command Line Configuration File > Builtin Configuration File
- rebuild logging system. Create a dedicate logger for each submodule. [Done]
Set the loglevel for each logger to the value which was specified in the configuration [Done]
=> Logging only reports events. Results are printed according to a dedicate option in the config file.
- write tests for all search engines and for all major modes in the source directory.
Enable Flag which runs the tests automatically. Differ between long tests and short ones.
- Look at some big open source python projects where tests are stored (pelican, requests)
30.11.2015
- Find good resources about to learn how to test code correctly
[DONE: 12min], found the following links:
- http://docs.python-guide.org/en/latest/writing/tests/
==> LEARNED:
- put test suites that require some complex data structures to load (such as websites to scrape) in separate test suites
- run all (fast) tests before committing code
- run all (including slow ones) before pushing code to master
- use tox for testing the code with multiple interpreter configurations
- mock allows to monkey patch functionality in the code such that it returns whatever you want
- http://codeutopia.net/blog/2015/04/11/what-are-unit-testing-integration-testing-and-functional-testing/
==> LEARNED:
- unittets don't make use of external resources such as databases or network
- code that is hard to unit test is often poorly designed
- integration test: tests how parts of the system work together
- functional tests: test the complete functionality of the system
- only a small amount of functional tests are required: They make sure the app works as a whole.
- "testing common user interactions"
- functional tests are validated in the same way as a user who uses the tool.
- unit/integration tests are validated with code
- don't make them too fine grained!
- https://code.google.com/p/robotframework/wiki/HowToWriteGoodTestCases
==> LEARNED:
- never sleep in the code: safety margins take too long in your code (use polls instead)
- http://blog.agilistic.nl/how-writing-unit-tests-force-you-to-write-good-code-and-6-bad-arguments-why-you-shouldnt/
==> LEARNED:
- Classes should be loosly coupled
- avoid cascade of changes when changing one class
- maximize encapsulation in classes
- classes should have one responsibility
- avoid large and tightly coupled classes
- unit test should test the function/class without any dependencies
- unit test tests one thing
- avoid like the PEST: tightly coupled functions/classes, difficult to understand classes/functions,
functions that do many things, not intuitive classes/functions (bad interface)
- http://www.toptal.com/python/an-introduction-to-mocking-in-python
==> LEARNED:
- instead of testing a functions effects, we can mock the underlying operating system api by
ensuring that a os function was called with certain parameters. This enables us to verify
that os code was called with the correct parameters.
- http://pytest.org/
==> LEARNED:
- How pytest can be invoked: http://pytest.org/latest/usage.html
- pytest can yield more information in the traceback with the -l option
- pytest can be called within python: http://pytest.org/latest/usage.html
- how the directory structure for tests should look like: http://pytest.org/latest/goodpractises.html
- Read and understand the test links collected in the previous task.
[Done: 75min + 25min]
- Add hook to run unit tests before committing code
[Done: 9 min]: Found pre-commit hook that checks pep8 stuff and that runs unit tests
here: https://gist.githubusercontent.com/snim2/6444684/raw/c7f1ec75c3cc0306bd8f36faee7dd201902528e8/pre-commit.py
--- 12 + 100 + 9 + 5 = 126min ---
1.12.2015
- Read that again: http://pytest.org/latest/example/parametrize.html
[Done: 9min], not learned anything really. Is about meta programming in test suites I guess.
- Create virtualenv in Project directory.
[Done: 5min]
- Add hook that runs all tests before pushing to master
[Done: 11min], Hook is a pre-commit hook and will execute all tests found in the directory tests/
- See whether existing test suites do work and fix all issues there.
[Started: 122min], integration tests do work. Functional tests fail, because there is a issue in GoogleScraper. Update: Both integration and functional tests do work.
--- 9 + 5 + 11 + 122 = 147min ---
2.12
- Find out why the test test_google_with_phantomjs_and_json_output fails. Why is it not possible to scrape 3 pages with Google in selenium mode?
[Done: 42min]: Because the next page element cannot be located in phantomjs mode for some reason.
- Why cannt phantomjs locate the next page?
[Done: 46min]:
- Check version of phantomjs: 1.9.0 is my version
- Newest version of phantomjs: 2.0, but it is too hard to install/compile
- Reason that search is interrupted: Exception is thrown in line
- Read about worker and job patterns (consumer-producer patterns) in python. Learn about queues patterns.
Read the following ressources:
- http://www.bogotobogo.com/python/Multithread/python_multithreading_Synchronization_Producer_Consumer_using_Queue.php
- https://pymotw.com/2/Queue/
- http://www.informit.com/articles/article.aspx?p=1850445&seqNum=8
- http://codefudge.com/2015/09/scraping-alchemist-celery-selenium-phantomjs-and-tor.html
- read about casperJS and evaluate whether it might be interesting for GoogleScraper
3.12
- Make functional tests work again
[Done: 120min]
-- Fix bug in `GoogleScraper -q 'apples' -s google -m selenium --sel-browser phantomjs -p 10`
7.12
- test that serp rank is cumulative among pages
[Done: 10min] Rank testing doesn't make any sense. Reasons:
- ranks start again in different type of serp results (ads vs normal)
- results aren't ordered by rank in json or csv/output
- ranks doesn't need to be cumulative, since their absolute rank can be
recalculated by multiplying with the page number.
- fix functional test issues of `test_all_search_engines_in_http_mode`