There is a mismatch between `output["topic-word-matrix"]` and `dataset.get_vocabulary()` in terms of words? #86

Zay-Ben · 2023-01-10T15:39:40Z

There is a mismatch between output["topic-word-matrix"] and dataset.get_vocabulary() in terms of words?

I created a Dataframe as follows:

df = pd.DataFrame(data = output["topic-word-matrix"], columns = dataset.get_vocabulary()).T

When I sort the Dataframe by a topic number to get the top words for a topic, why do the results differ from output["topics"][i]?

Thank you!

The text was updated successfully, but these errors were encountered:

silviatti · 2023-02-02T09:20:40Z

There should be a one-to-one correspondence between the two. It's difficult to say what is wrong. Can you share more details about the problem?

Zay-Ben · 2023-02-02T12:11:51Z

Good day Dr. Silvia, nice to see you again, and thank you for reply. Here are the details of the issue. :)

First, I created a dataset folder containing two files, namely corpus.txt and vocabulary.tsv as the OCTIS module required.

The corpus file:

The vocabulary file (sorted alphabetically):

Second, I loaded the dataset and trained LDA models with the dataset.

Third, after training, I imported one of the LDA models. With the model’s topic-word-matrix as the data and the dataset’s vocabulary as the column. The resulting data frame is shown in the figure below:

Last, the top 5 words of the data frame’s first topic are different from the top 5 words of the model’s first topic.

I can't determine why there are discrepancies in the top words of the topics.

With appreciation,

Benz

silviatti · 2023-04-15T13:49:34Z

Hi Benz,
sorry for the late reply. I haven't had time to work on OCTIS these months. There's something weird, I agree.
I would suggest two experiments in case you're still interesting in this issue:

Can you also print out dataset.get_vocabulary()? Just to see if the vocabulary match with your file.
Could you try to repeat the experiment with another model and see if you have the same problem? I'd like to see if the problem is only of LDA or it's general.

Thanks for your patience.

Silvia

Zay-Ben · 2023-04-15T15:34:14Z

Dear Dr. Silvia,

Thank you for taking the time to address my questions.

Regarding the first question, the results show that the order of the vocabulary before and after importing it using OCTIS is different. The vocabulary was sorted alphabetically before importing and shuffled randomly (seemingly) after importing, as shown in the image with the first five terms of each vocabulary.

Regarding the second question, I trained two models (ETM and NMF) using the same dataset and found that the problem persists for NMF, but not for ETM, as shown in the figure below. I noticed that OCTIS's LDA and NMF are both from Gensim. Could this be the source of the error?

ETM:

NMF:

Just to give context, the dataset consists of tweets that contain customer complaints about telecommunication companies.

Thank you again for your help! Topic modeling has never been easy without OCTIS. 😭

silviatti · 2023-05-03T07:31:13Z

Hi,
just to double-check, when you load the custom dataset, do you have a file in the dataset folder called vocabulary.txt? That should be the vocabulary file were words are sorted alphabetically. I asked this question because I noticed that your file is called "words.txt", so it can be possible that OCTIS doesn't load it.

Let me know :)

Zay-Ben changed the title ~~May I ask if there is a mismatch between output["topic-word-matrix"] and dataset.get_vocabulary() in terms of words?~~ There is a mismatch between output["topic-word-matrix"] and dataset.get_vocabulary() in terms of words? Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There is a mismatch between `output["topic-word-matrix"]` and `dataset.get_vocabulary()` in terms of words? #86

There is a mismatch between `output["topic-word-matrix"]` and `dataset.get_vocabulary()` in terms of words? #86

Zay-Ben commented Jan 10, 2023 •

edited

Loading

silviatti commented Feb 2, 2023

Zay-Ben commented Feb 2, 2023 •

edited

Loading

silviatti commented Apr 15, 2023

Zay-Ben commented Apr 15, 2023

silviatti commented May 3, 2023

There is a mismatch between output["topic-word-matrix"] and dataset.get_vocabulary() in terms of words? #86

There is a mismatch between output["topic-word-matrix"] and dataset.get_vocabulary() in terms of words? #86

Comments

Zay-Ben commented Jan 10, 2023 • edited Loading

silviatti commented Feb 2, 2023

Zay-Ben commented Feb 2, 2023 • edited Loading

silviatti commented Apr 15, 2023

Zay-Ben commented Apr 15, 2023

silviatti commented May 3, 2023

There is a mismatch between `output["topic-word-matrix"]` and `dataset.get_vocabulary()` in terms of words? #86

There is a mismatch between `output["topic-word-matrix"]` and `dataset.get_vocabulary()` in terms of words? #86

Zay-Ben commented Jan 10, 2023 •

edited

Loading

Zay-Ben commented Feb 2, 2023 •

edited

Loading