Generating Word Cloud from Bengali Text using Python

Samin Yaser

4 minute read · Saturday, September 30, 2023


Bengali wordcloud
Figure: Bengali wordcloud

Introduction

Word cloud is a visual representation of a set of words. It is a useful tool for understanding the frequency of words in a body of text. Word clouds are often used to describe the contents of a dataset in various NLP related projects.

In this tutorial, we will generate a word cloud from a Bengali dataset using wordcloud library in python. We will be working with a dataset consists of Wikipedia articles, most commonly known as the Wikipedia Dataset. It can be downloaded from here. You can also get it from here, but in this case you have to manually extract and clean the articles from the XML file. Maybe I’ll write a tutorial on how to do that in the future.

Note that it is a big dataset, so the code in this tutorial is optimized for memory.

Importing Libraries

First, we import the necessary libraries.

import re
import numpy as np
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
from collections import Counter

re, wordcloud and matplotlib are the libraries we need to generate and display the word cloud.

numpy and PIL are needed if you want generate the word cloud in a specific shape, so it can be ommitted if your are not interested in that.

Counter is recommended if you have a large dataset and low on RAM.

Removing Stop Words

Stop words are words which are not important to the meaning of the text. They are usually used most frequently in the text. So, generating a wordcloud with stop words will not give us the most meaningful result. Look at the figure below and see for yourself.

Bengali wordcloud with stop words
Figure: Bengali wordcloud with stop words

Easiest way to remove stop words are to use bnlp-toolkit library. This is how you do it.

# Install bnlp-toolkit with this command
# pip install bnlp-toolkit

from bnlp.corpus import stopwords
from bnlp.corpus.util import remove_stopwords

raw_text = 'আমি ভাত খাই।'
result = remove_stopwords(raw_text, stopwords)
print(result)
# ['ভাত', 'খাই', '।']

Pretty easy, right? However, if you have your own list of stop-words, you can do this to clean your dataset.

with open('../Datasets/stopwords.txt', 'r', encoding='utf-8') as fin:
  stopwords = fin.read().splitlines()

with open('../Datasets/wiki/all_v2_clean_sentence_no_stopwords.txt', 'w', encoding='utf-8') as fout:
  with open('../Datasets/wiki/all_v2_clean_sentence.txt', 'r', encoding='utf-8') as fin:
    for line in tqdm(fin):
      words = line.split(' ')
      clean_line = [word for word in words if word not in stopwords]
      clean_line = ' '.join(clean_line)
      fout.write(clean_line)

Here, we are cleaning the dataset from stop words line by line, which is a bit slow, but saves memory if you have a large dataset. Feel free to do it all at once if your dataset is small.

Note that you are also saving the cleaned dataset in a new text file so that we don’t have to repeat this time consuming process ever again.

Getting the most frequent words

This step is optional if you have a small dataset. But if you have a large dataset, wordcloud function might eat up all your memory and crash your PC. To stop this, we use Counter from collections library, which can count the frequency of each word in a dataset in a memory friendly way. We can then use these frequently used words to generate the word cloud. In my experience, this method is also faster than just simply passing the whole dataset to generate the word cloud.

with open('../Datasets/wiki/all_v2_clean_sentence_no_stopwords.txt', 'r', encoding='utf-8') as fin:
  text = fin.read().split(' ')

counter = Counter(text)

most_common_words = counter.most_common(2000) # Use 2000 most common words to generate the word cloud

Generating the Word Cloud ☁️

Finally, we can generate the word cloud. We are also giving it the shape of Bangladeshi flag.

Flag of Bangladesh
Figure: Flag of Bangladesh

I also suggest getting a Bengali font. I am using Nirmala UI, which is the default Bengali font in Windows.

regex = r"([\S]+)"
mask = np.array(Image.open("bd.jpg")) # Load your picture here

def plot_world_cloud(text):

    wordcloud = WordCloud(
        # Width and height of the word cloud image
        width = 1000,
        height = 500,

        mode='RGBA',
        background_color ='white',
        font_path='fonts/nirmala.ttf', # Use your own font here
        regexp=regex,
        collocations=True,
        mask=mask,

        # Maximum number of words in the word cloud
        max_words=2000,
    )

    wordcloud.generate(text)

    # plot the WordCloud image
    image_colors = ImageColorGenerator(mask)
    plt.figure(figsize = (15, 15))
    plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear")
    plt.axis("off")
    plt.tight_layout(pad = 0)

    plt.show()

plot_world_cloud(' '.join([i[0] for i in most_common_words]))
Word Cloud with Custom Shape
Figure: Word Cloud with Custom Shape

Passing regex = r"([\S]+)" to the wordcloud function is crucial. Otherwise, your output image might look something like this where conjunctive letters are broken.

Broken Conjunctive Letters
Figure: Broken Conjunctive Letters

Alternatives for Word Cloud

You might also want check out this where a guy generated a bar chart out of the most common words in his dataset. This may not be as good looking as a word cloud, but if you want go a more statistical route, this is the way to go.

Conclusion

This is a good starting point for generating word clouds. Hope it helped you. Heres a Github gist with the code.