- search on documents based on its content
- the rank must take into account document text and topic content (e.g. common words on document of the same topic)
- Create a text database
- Tokenize and index the documents
- Create the topic profile
- Process basic searches
- Process advanced searches
Our text database is composed of 1 518 Wikipedia's pages extracted from 3 mains category :
- History and events
- Natural and physical science
- Religions and belief systems
- src/main/java
- Application.java : GUI of the project
- BarChart.java : a simple BarChart from TopicProfile (not useful)
- Main.java : the indexing and search part of the search engine
- Statistics.java : a class with functions that give you more information on the corpus and terms
- TopicModeling.java : topic extraction and topic profile are done here (the most relevant words are calculated)
- src/main/resources
- Icon : folder containing icons for the GUI
- topics_folder : folder with ranked words per topic based on different metrics (TF, IDF & TF-IDF) topic_occurences.txt, topic_idfs.txt and topic_tfidfs.txt (with topic = history or religion or sciences
- wordnet_prolog : folder with Wordnet files for advanced synonym searches
- english_stopwords.txt : txt file with common stop words (and useless words) used for topic profile
- WikiData.XML & WikiData.CSV : documents under different file format (for ease of use)
- WikiDumpXMLtoCSV.py : python file that is automatically compiled when maven project is launched -> transform WikiData XML into CSV
As simple as that !
$ maven clean
$ maven compile
$ maven exec:java
It is possible to run the TopicModeling and the Statistics files as standalone directly with the main function.
- Lucene - for the search engine part (indexing, tokenization, search...)
- Stanford CoreNLP - for the topic profile part/analysis of the documents
- OpenCSV - for the parsing part of the WikiData CSV file in order to index the documents
- WikiDumpReader - the parser for transforming WikiData XML into WikiData CSV - Need Python 3