Purpose

This repository contains programs which create bag-of-words from Wikipedia database which is a large content in XML syntax in various programming languages to support (not completely as of now, but in future):

Various conditions to bag-of-words such as the minimum number of each word in a document, etc.
Documents in English and in Japanese
Apply TF-IDF and normalization
Parsing in concurrent way employing threads
Web interface to check the intermediate report e.g. throughput
Test program to check if the program produces correct results with or without threads (under share directory)

Hoping to provide as a module in these programming languages for general purpose, not specifically for Wikipedia database. In terms of it, this repository is like an ecosystem to try and learn implementations in a better way.


<page>
	<title> ... </title>
	<text ...>
		HERE IS THE CONTENTS
	</text>
</page>
...

Name		Name	Last commit message	Last commit date
Latest commit History 485 Commits
c		c
cpp		cpp
cython		cython
d		d
elixir		elixir
erlang		erlang
go		go
haskell		haskell
java		java
node		node
objc		objc
perl		perl
python		python
ruby		ruby
rust		rust
scala		scala
share		share
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Purpose

About

Releases

Packages

Contributors 2

Languages

nishidy/ParseWikipediaXML

Folders and files

Latest commit

History

Repository files navigation

Purpose

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages