Erik's Web Crawler and Search Engine

A ruby based web crawler and search engine that pull information from lyle.smu.edu/~fmoore/ and creates an inverted index, a word frequency list, a list of pages, and all links. The Search Engine loads the text from recorded pages provided by tokens.txt and loads them into tf-idf-similarity Objects that are able to form cosine similarity matrices with other objects in order to remove duplicate pages, and return the correct results to the user.

##Software Software Used:

Ruby version 1.9.3
Mechanize Gem
Nokogiri Gem
tf-idf-similarity Gem

##Installation This web crawler uses ruby version 1.9.3. I use RVM to control which verion of ruby to use during a program. There are other ways to install ruby, however the steps below assume rvm is installed

To install the correct ruby version run rvm install ruby-1.9.3
Next install the mechanize gem, which directs you to the correct web page, and Nokogiri, the html parser used as the web crawler
gem install mechanize
gem install tf-idf-similarity
gem install nokogiri

##Use To use the web crawler, run from terminal: ruby crawler.rb
Stop words are included in a file called stop_words.txt. Running the web crawler will produce several .txt files:

pages.txt -> the pages visited by the web crawler
links.txt -> all links discovered by the crawler
broken_links.txt -> all broken links in site
tokens.txt -> a list of urls with their respective document text, This is then loaded in the search engine to create the tf-idf matrix
freq.txt -> list of terms and their frequency

After these pages are created you may then run the search engine from the terminal by running: ruby search_engine.rb
This will open up an interactive search engine that prompts the user to type a query and displays the top N results. The default is 3 but can be changed by changing the NUM_TIMES variable to equal a different number in 'search_engine.rb'. To quit the search engine simply type the word "Quit"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Erik's Web Crawler and Search Engine

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
crawler.rb		crawler.rb
freq.txt		freq.txt
links.txt		links.txt
pages.txt		pages.txt
sample_queries.txt		sample_queries.txt
search_engine.rb		search_engine.rb
stop_words.txt		stop_words.txt
tokens.txt		tokens.txt

egabrielsen/webcrawler

Folders and files

Latest commit

History

Repository files navigation

Erik's Web Crawler and Search Engine

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages