Using BERTopic on Japanese Texts

One common feature of the majority examples of using BERTopic is that the texts are mostly in English. Modern English uses a space to separate words, but languages such as Chinese and Japanese do not follow this convention. This difference plays a huge part when using BERTopic. In this project, I demonstrate the differences in results when generating topics with or without a tokenizer when analyzing Japanese texts.

Visualization

Figure 1: Change in results with and without a Japanese tokenizer

Figure 2: Result without a Japanese Tokenizer

Figure 3: Result with a Japanese Tokenizer

Conclusion

From analyzing Figure 1, it is evident that applying a Japanese tokenizer to BERTopic significantly impacts the topic representation. When BERTopic is not coupled with a Japanese tokenizer, the topic representations resemble complete sentences rather than individual words, resulting in a cluttered and disorganized topic word score figure. However, when we utilize a Japanese tokenizer, the results are more appropriate for our objective of detecting and visualizing topics. Figure 2 and Figure 3 exhibit the outcomes of each scenario, respectively. Therefore, incorporating a Japanese tokenizer in BERTopic enhances the topic representation and makes it more useful for topic identification and visualization.

Scrapping Tweets

snscrape - Scrape contents from social networking services (SNS)

https://github.com/JustAnotherArchivist/snscrape

Check my tutorial on how to use snscrape to scrape twitter data:

https://medium.com/@cd_24/using-bertopic-to-analyze-qatar-world-cup-twitter-data-a5956c4949f1

Japanese Stop Words

I adopted the list of Japanese stop words from the following:

https://github.com/stopwords-iso/stopwords-ja

Code

The codes programmed in this project are displayed in this repository. Feel free to check and use them.

Relevant Materials

If you are interested in learning more about how to develop a Japanese tokenizer and the necessity of using a tokenizer when utilizing BERTopic for analyzing Japanese texts, I recommend reading my post on Medium. In the article, I delve into the importance of tokenization in natural language processing and explain how Japanese language's unique characteristics require a specific tokenizer to ensure accurate analysis. Furthermore, I outline the steps involved in building a Japanese tokenizer and provide detailed code snippets for your reference.

Using BERTopic on Japanese Texts

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
JapaneseTokenizer.ipynb		JapaneseTokenizer.ipynb
LICENSE		LICENSE
README.md		README.md
WithoutJapaneseTokenizer.ipynb		WithoutJapaneseTokenizer.ipynb
change.png		change.png
newplot-12.png		newplot-12.png
newplot-13.png		newplot-13.png
stopwords-ja.txt		stopwords-ja.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using BERTopic on Japanese Texts

Visualization

Conclusion

Scrapping Tweets

snscrape - Scrape contents from social networking services (SNS)

Check my tutorial on how to use snscrape to scrape twitter data:

Japanese Stop Words

Code

Relevant Materials

About

Releases

Packages

Languages

License

Damen-C/BERTopic-on-Japanese-Texts

Folders and files

Latest commit

History

Repository files navigation

Using BERTopic on Japanese Texts

Visualization

Conclusion

Scrapping Tweets

snscrape - Scrape contents from social networking services (SNS)

Check my tutorial on how to use snscrape to scrape twitter data:

Japanese Stop Words

Code

Relevant Materials

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages