Skip to content

Latest commit

 

History

History
35 lines (32 loc) · 2.15 KB

README.md

File metadata and controls

35 lines (32 loc) · 2.15 KB

Erik's Web Crawler and Search Engine

A ruby based web crawler and search engine that pull information from lyle.smu.edu/~fmoore/ and creates an inverted index, a word frequency list, a list of pages, and all links. The Search Engine loads the text from recorded pages provided by tokens.txt and loads them into tf-idf-similarity Objects that are able to form cosine similarity matrices with other objects in order to remove duplicate pages, and return the correct results to the user.

##Software Software Used:

  • Ruby version 1.9.3
  • Mechanize Gem
  • Nokogiri Gem
  • tf-idf-similarity Gem

##Installation This web crawler uses ruby version 1.9.3. I use RVM to control which verion of ruby to use during a program. There are other ways to install ruby, however the steps below assume rvm is installed

To install the correct ruby version run rvm install ruby-1.9.3
Next install the mechanize gem, which directs you to the correct web page, and Nokogiri, the html parser used as the web crawler
gem install mechanize
gem install tf-idf-similarity
gem install nokogiri

##Use To use the web crawler, run from terminal: ruby crawler.rb
Stop words are included in a file called stop_words.txt. Running the web crawler will produce several .txt files:

  • pages.txt -> the pages visited by the web crawler
  • links.txt -> all links discovered by the crawler
  • broken_links.txt -> all broken links in site
  • tokens.txt -> a list of urls with their respective document text, This is then loaded in the search engine to create the tf-idf matrix
  • freq.txt -> list of terms and their frequency

After these pages are created you may then run the search engine from the terminal by running: ruby search_engine.rb
This will open up an interactive search engine that prompts the user to type a query and displays the top N results. The default is 3 but can be changed by changing the NUM_TIMES variable to equal a different number in 'search_engine.rb'. To quit the search engine simply type the word "Quit"