Large Bangla Web Crawl

Description

A corpus of 3 million webpages crawled from Bangladeshi sites. It contains 180 million Bangla words and 500,000 unique words. Sites include all news portals, blogs and some other popular Bangladeshi sites. It enables researchers to find trends, analyze particular event's presence, train artificial intelligence powered systems for unsupervised training etc. The possibilities are limitless.

Data Fromat

The corpus is released in very popular Common Crawl WARC format, making it very easy for any researcher to use already publicly available tools for Common Crawl WARC format.

Size

The total corpus is currently 150 GB in size, but it is constantly growing. We have setup a distributed web crawler (using Apache Nutch) which is continuously crawling popular Bangla sites, storing it in MongoDB and indexing it in ElasticSearch.

How To Get The Full Version

Due to it's large size, only a sample of 7 MB is available in Github. If you need the full version, we can arrange a way to send the dataset to you. Please email at contact@socian.ai

License

The corpus is licensed under GNU GPLv3, making it very easy to anyone to use the data for any purpose.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
crawl.txt		crawl.txt
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large Bangla Web Crawl

Description

Data Fromat

Size

How To Get The Full Version

License

About

Releases

Packages

Contributors 2

socian-ai/socian-large-bangla-web-crawl

Folders and files

Latest commit

History

Repository files navigation

Large Bangla Web Crawl

Description

Data Fromat

Size

How To Get The Full Version

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages