Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl existing web history? #35

Open
jarmitage opened this issue Sep 12, 2016 · 2 comments
Open

Crawl existing web history? #35

jarmitage opened this issue Sep 12, 2016 · 2 comments

Comments

@jarmitage
Copy link

This might be a can of worms but, at the least I'd be interested to know your [the author's] thoughts on if it's feasible/how one might approach scraping a user's web history and integrating that content into the search capabilities of this tool.

Thanks for making this plugin!

@dldx
Copy link

dldx commented Sep 19, 2016

I've come up with a hackish way and a little technical way of doing this. Chrome/Opera stores the past 3 months worth of history, not more, which is annoying but that is what we have to work with. For me, that's still a helluva lot of urls so I had to come up with various ways of filtering it down to something more manageable. I don't really want to load every random website I visited in any case.. So here's what I did. These instructions are for Linux but I'm sure they would be similar on Mac too:

  1. Change Chrome's settings to not load any images to save bandwidth and memory. Also close/save any tabs you care about because we're going to load a lot of new tabs at once and you won't be able to rescue old ones.
  2. Close all windows of Chrome/Opera - you can't open the history file if you don't.
  3. Install SQliteman or a similar SQLite database viewer and sqlite3-pcre (a regex plugin for sqlite)
  4. Open the History database which is located at ~/.config/google-chrome/Default/History (or something similar if you have several profiles) or ~/.config/opera/History
  5. Load the regex plugin into Sqliteman with SELECT load_extension('/usr/lib/sqlite3/pcre.so');
  6. Run the following code to create a list of websites you want.
    select urls.url from urls inner join visits on urls.id = visits.url where urls.url not like '%google.%' and urls.url not like '%facebook.com%' and urls.url not like '%youtube.com%' and urls.url not like '%localhost%' and urls.url not like '%127.0%' and urls.url not like '%192.168%' and urls.url not like '%zero%' and urls.url not like '%out.reddit.com%' and urls.url not regexp '^https?:\/\/[\w\.]+[a-z\/]?$' and urls.title like '%income%' or urls.title like '%climate%' group by urls.url order by sum(visits.visit_duration) desc;
    This is just an example but you can change it to suit your needs. For example, I filtered out facebook, youtube, localhost, etc because they wouldn't be interesting. Then I filtered out all urls that go to the homepage of a site and finally I searched for the words "income" or "climate" in the page titles because I'm interested in basic income and climate change. Without those final filters, I would get thousands of urls but with them, I only get about 200. Anyway, play with the filters a bit in sqliteman to get a list of urls you want to archive but make sure it isn't too long. Save the SQL code you used, including the load_extension line to a file called interesting_sites.sql. Then close sqliteman.
  7. Open a terminal and run something like this:
    cat interesting_sites.sql | sqlite3 ~/.config/opera-developer/History | while read line; do opera-developer --new-page $line &; done
    Replace opera-developer with google-chrome, etc, etc
  8. This command will get the list of urls from sqlite, then load up each url in chrome/opera and hopefully, falcon will automatically index every site. It worked pretty well for me and only took a few seconds to load about 150 sites.

Hope that helps. I'll try to find a way to do better filtering of history but this is what I have so far!

Cheers,
Durand

@blackforestboi
Copy link

hey @dldx @jarmitage

We forked the Falcon tool a while back and integrated the import of the existing history and bookmarks.

We have done it by importing it via the chrome.history/bookmarks api.
You can check it out here: https://github.com/WorldBrain/Research-Engine

We are more than happy to collaborate on this in the future!

Best,
Oliver

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants