Skip to content

GameDisplayer/custom-wiki-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Project B: Customized Wikipedia search engine

Goal : Create a search engine for Wiki content

  • search on documents based on its content
  • the rank must take into account document text and topic content (e.g. common words on document of the same topic)

Given steps :

  • Create a text database
  • Tokenize and index the documents
  • Create the topic profile
  • Process basic searches
  • Process advanced searches

Database :

Our text database is composed of 1 518 Wikipedia's pages extracted from 3 mains category :

  1. History and events
  2. Natural and physical science
  3. Religions and belief systems

The project structure :

  • src/main/java
    1. Application.java : GUI of the project
    2. BarChart.java : a simple BarChart from TopicProfile (not useful)
    3. Main.java : the indexing and search part of the search engine
    4. Statistics.java : a class with functions that give you more information on the corpus and terms
    5. TopicModeling.java : topic extraction and topic profile are done here (the most relevant words are calculated)
  • src/main/resources
    1. Icon : folder containing icons for the GUI
    2. topics_folder : folder with ranked words per topic based on different metrics (TF, IDF & TF-IDF) topic_occurences.txt, topic_idfs.txt and topic_tfidfs.txt (with topic = history or religion or sciences
    3. wordnet_prolog : folder with Wordnet files for advanced synonym searches
    4. english_stopwords.txt : txt file with common stop words (and useless words) used for topic profile
    5. WikiData.XML & WikiData.CSV : documents under different file format (for ease of use)
    6. WikiDumpXMLtoCSV.py : python file that is automatically compiled when maven project is launched -> transform WikiData XML into CSV

How to launch the project ?

As simple as that !

$ maven clean
$ maven compile
$ maven exec:java

What other things I have to know ?

It is possible to run the TopicModeling and the Statistics files as standalone directly with the main function.

APIs used and usage :

  • Lucene - for the search engine part (indexing, tokenization, search...)
  • Stanford CoreNLP - for the topic profile part/analysis of the documents
  • OpenCSV - for the parsing part of the WikiData CSV file in order to index the documents
  • WikiDumpReader - the parser for transforming WikiData XML into WikiData CSV - Need Python 3