Project B: Customized Wikipedia search engine

Goal : Create a search engine for Wiki content

search on documents based on its content
the rank must take into account document text and topic content (e.g. common words on document of the same topic)

Given steps :

Database :

Our text database is composed of 1 518 Wikipedia's pages extracted from 3 mains category :

History and events
Natural and physical science
Religions and belief systems

The project structure :

src/main/java
1. Application.java : GUI of the project
2. BarChart.java : a simple BarChart from TopicProfile (not useful)
3. Main.java : the indexing and search part of the search engine
4. Statistics.java : a class with functions that give you more information on the corpus and terms
5. TopicModeling.java : topic extraction and topic profile are done here (the most relevant words are calculated)
src/main/resources
1. Icon : folder containing icons for the GUI
2. topics_folder : folder with ranked words per topic based on different metrics (TF, IDF & TF-IDF) topic_occurences.txt, topic_idfs.txt and topic_tfidfs.txt (with topic = history or religion or sciences
3. wordnet_prolog : folder with Wordnet files for advanced synonym searches
4. english_stopwords.txt : txt file with common stop words (and useless words) used for topic profile
5. WikiData.XML & WikiData.CSV : documents under different file format (for ease of use)
6. WikiDumpXMLtoCSV.py : python file that is automatically compiled when maven project is launched -> transform WikiData XML into CSV

How to launch the project ?

As simple as that !

$ maven clean
$ maven compile
$ maven exec:java

What other things I have to know ?

It is possible to run the TopicModeling and the Statistics files as standalone directly with the main function.

APIs used and usage :

Lucene - for the search engine part (indexing, tokenization, search...)
Stanford CoreNLP - for the topic profile part/analysis of the documents
OpenCSV - for the parsing part of the WikiData CSV file in order to index the documents
WikiDumpReader - the parser for transforming WikiData XML into WikiData CSV - Need Python 3

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
CustomWikiSearchEngine		CustomWikiSearchEngine
.gitignore		.gitignore
README.md		README.md
SINGLAN_MICHELUCCI_Report.pdf		SINGLAN_MICHELUCCI_Report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project B: Customized Wikipedia search engine

Goal : Create a search engine for Wiki content

Given steps :

Database :

The project structure :

How to launch the project ?

What other things I have to know ?

APIs used and usage :

About

Releases 1

Packages

Contributors 2

Languages

GameDisplayer/custom-wiki-search-engine

Folders and files

Latest commit

History

Repository files navigation

Project B: Customized Wikipedia search engine

Goal : Create a search engine for Wiki content

Given steps :

Database :

The project structure :

How to launch the project ?

What other things I have to know ?

APIs used and usage :

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages