Crawl existing web history? #35

jarmitage · 2016-09-12T18:24:41Z

This might be a can of worms but, at the least I'd be interested to know your [the author's] thoughts on if it's feasible/how one might approach scraping a user's web history and integrating that content into the search capabilities of this tool.

Thanks for making this plugin!

dldx · 2016-09-19T11:17:44Z

I've come up with a hackish way and a little technical way of doing this. Chrome/Opera stores the past 3 months worth of history, not more, which is annoying but that is what we have to work with. For me, that's still a helluva lot of urls so I had to come up with various ways of filtering it down to something more manageable. I don't really want to load every random website I visited in any case.. So here's what I did. These instructions are for Linux but I'm sure they would be similar on Mac too:

Change Chrome's settings to not load any images to save bandwidth and memory. Also close/save any tabs you care about because we're going to load a lot of new tabs at once and you won't be able to rescue old ones.
Close all windows of Chrome/Opera - you can't open the history file if you don't.
Install SQliteman or a similar SQLite database viewer and sqlite3-pcre (a regex plugin for sqlite)
Open the History database which is located at ~/.config/google-chrome/Default/History (or something similar if you have several profiles) or ~/.config/opera/History
Load the regex plugin into Sqliteman with SELECT load_extension('/usr/lib/sqlite3/pcre.so');
Run the following code to create a list of websites you want.
select urls.url from urls inner join visits on urls.id = visits.url where urls.url not like '%google.%' and urls.url not like '%facebook.com%' and urls.url not like '%youtube.com%' and urls.url not like '%localhost%' and urls.url not like '%127.0%' and urls.url not like '%192.168%' and urls.url not like '%zero%' and urls.url not like '%out.reddit.com%' and urls.url not regexp '^https?:\/\/[\w\.]+[a-z\/]?$' and urls.title like '%income%' or urls.title like '%climate%' group by urls.url order by sum(visits.visit_duration) desc;
This is just an example but you can change it to suit your needs. For example, I filtered out facebook, youtube, localhost, etc because they wouldn't be interesting. Then I filtered out all urls that go to the homepage of a site and finally I searched for the words "income" or "climate" in the page titles because I'm interested in basic income and climate change. Without those final filters, I would get thousands of urls but with them, I only get about 200. Anyway, play with the filters a bit in sqliteman to get a list of urls you want to archive but make sure it isn't too long. Save the SQL code you used, including the load_extension line to a file called interesting_sites.sql. Then close sqliteman.
Open a terminal and run something like this:
cat interesting_sites.sql | sqlite3 ~/.config/opera-developer/History | while read line; do opera-developer --new-page $line &; done
Replace opera-developer with google-chrome, etc, etc
This command will get the list of urls from sqlite, then load up each url in chrome/opera and hopefully, falcon will automatically index every site. It worked pretty well for me and only took a few seconds to load about 150 sites.

Hope that helps. I'll try to find a way to do better filtering of history but this is what I have so far!

Cheers,
Durand

blackforestboi · 2017-01-11T12:37:28Z

hey @dldx @jarmitage

We forked the Falcon tool a while back and integrated the import of the existing history and bookmarks.

We have done it by importing it via the chrome.history/bookmarks api.
You can check it out here: https://github.com/WorldBrain/Research-Engine

We are more than happy to collaborate on this in the future!

Best,
Oliver

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl existing web history? #35

Crawl existing web history? #35

jarmitage commented Sep 12, 2016

dldx commented Sep 19, 2016

blackforestboi commented Jan 11, 2017

Crawl existing web history? #35

Crawl existing web history? #35

Comments

jarmitage commented Sep 12, 2016

dldx commented Sep 19, 2016

blackforestboi commented Jan 11, 2017