Skip to content

socian-ai/socian-large-bangla-web-crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Large Bangla Web Crawl

Description

A corpus of 3 million webpages crawled from Bangladeshi sites. It contains 180 million Bangla words and 500,000 unique words. Sites include all news portals, blogs and some other popular Bangladeshi sites. It enables researchers to find trends, analyze particular event's presence, train artificial intelligence powered systems for unsupervised training etc. The possibilities are limitless.

Data Fromat

The corpus is released in very popular Common Crawl WARC format, making it very easy for any researcher to use already publicly available tools for Common Crawl WARC format.

Size

The total corpus is currently 150 GB in size, but it is constantly growing. We have setup a distributed web crawler (using Apache Nutch) which is continuously crawling popular Bangla sites, storing it in MongoDB and indexing it in ElasticSearch.

How To Get The Full Version

Due to it's large size, only a sample of 7 MB is available in Github. If you need the full version, we can arrange a way to send the dataset to you. Please email at contact@socian.ai

License

The corpus is licensed under GNU GPLv3, making it very easy to anyone to use the data for any purpose.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published