Skip to content

Web Crawler for CSE Information and Data Retreival

Notifications You must be signed in to change notification settings

egabrielsen/webcrawler

Repository files navigation

Erik's Web Crawler and Search Engine

A ruby based web crawler and search engine that pull information from lyle.smu.edu/~fmoore/ and creates an inverted index, a word frequency list, a list of pages, and all links. The Search Engine loads the text from recorded pages provided by tokens.txt and loads them into tf-idf-similarity Objects that are able to form cosine similarity matrices with other objects in order to remove duplicate pages, and return the correct results to the user.

##Software Software Used:

  • Ruby version 1.9.3
  • Mechanize Gem
  • Nokogiri Gem
  • tf-idf-similarity Gem

##Installation This web crawler uses ruby version 1.9.3. I use RVM to control which verion of ruby to use during a program. There are other ways to install ruby, however the steps below assume rvm is installed

To install the correct ruby version run rvm install ruby-1.9.3
Next install the mechanize gem, which directs you to the correct web page, and Nokogiri, the html parser used as the web crawler
gem install mechanize
gem install tf-idf-similarity
gem install nokogiri

##Use To use the web crawler, run from terminal: ruby crawler.rb
Stop words are included in a file called stop_words.txt. Running the web crawler will produce several .txt files:

  • pages.txt -> the pages visited by the web crawler
  • links.txt -> all links discovered by the crawler
  • broken_links.txt -> all broken links in site
  • tokens.txt -> a list of urls with their respective document text, This is then loaded in the search engine to create the tf-idf matrix
  • freq.txt -> list of terms and their frequency

After these pages are created you may then run the search engine from the terminal by running: ruby search_engine.rb
This will open up an interactive search engine that prompts the user to type a query and displays the top N results. The default is 3 but can be changed by changing the NUM_TIMES variable to equal a different number in 'search_engine.rb'. To quit the search engine simply type the word "Quit"

About

Web Crawler for CSE Information and Data Retreival

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages