You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I read in my own data and save it as .txt with one document per line. Then I define the preprocessing and execute it via preprocessor.preprocess_dataset. The error message is AttributeError: 'list' object has no attribute 'lower'. If I set no num_processes all is working.
The loop in simple_preprocessing_steps in combination with process_map breaks the documents into letters. See below
import os
import string
from octis.preprocessing.preprocessing import Preprocessing
import pandas as pd
docs = pd.read_csv('tweets.csv',lineterminator='\n')
docs['clean_tweets'].to_csv('documents.txt', header=None, sep='\n', mode='w', encoding="utf-8")
preprocessor = Preprocessing( max_features=None,
remove_punctuation=True, punctuation=string.punctuation,
lemmatize=True, stopword_list='german',
min_chars=1, min_words_docs=0, language= 'german', split = False, num_processes= 36, max_df= 0.9, min_df = 0.05)
# preprocess
dataset = preprocessor.preprocess_dataset(documents_path='documents.txt')
Traceback (most recent call last):
File "/home/p/p_drec01/lda/preprocess_lda_test.py", line 40, in <module>
dataset = preprocessor.preprocess_dataset(documents_path='/scratch/tmp/p_drec01/lda/octis_data/documents.txt')
File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/octis/preprocessing/preprocessing.py", line 171, in preprocess_dataset
vocabulary = self.filter_words(docs)
File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/octis/preprocessing/preprocessing.py", line 290, in filter_words
vectorizer.fit_transform(docs)
File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1846, in fit_transform
X = super().fit_transform(raw_documents)
File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1202, in fit_transform
vocabulary, X = self._count_vocab(raw_documents,
File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1114, in _count_vocab
for feature in analyze(doc):
File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 104, in _analyze
doc = preprocessor(doc)
File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 69, in _preprocess
doc = doc.lower()
AttributeError: 'list' object has no attribute 'lower'
##############
documents_path = 'documents.txt'
docs2 = [line.strip() for line in open(documents_path, 'r').readlines()]
def simple_preprocessing_steps( docs):
tmp_docs = []
for d in docs:
print(d)
docs2 = process_map(simple_preprocessing_steps, docs2, max_workers=16, chunksize=1)
Ü
b
e
r
6
"
U
M
n
etc.
The text was updated successfully, but these errors were encountered:
I'm getting the same issue. The issue only seems to persist if, when using Preprocessing, num_processes is not None or if split=True. Seems like these functions transform a list of strings (e.g., ['dog', 'cat']) to a list of a list of strings (e.g., [['d', 'o', g'], ['c', 'a', 't']])
OCTIS version: 1.11.0
Python version: 3.8.15
Operating System: 'posix'
Description - What I Did
I read in my own data and save it as .txt with one document per line. Then I define the preprocessing and execute it via preprocessor.preprocess_dataset. The error message is AttributeError: 'list' object has no attribute 'lower'. If I set no num_processes all is working.
The loop in simple_preprocessing_steps in combination with process_map breaks the documents into letters. See below
The text was updated successfully, but these errors were encountered: