AttributeError: 'list' object has no attribute 'lower' preprocessor.preprocess_dataset when num_processes != None #99

p-dre · 2023-03-06T10:08:16Z

OCTIS version: 1.11.0
Python version: 3.8.15
Operating System: 'posix'

Description - What I Did

I read in my own data and save it as .txt with one document per line. Then I define the preprocessing and execute it via preprocessor.preprocess_dataset. The error message is AttributeError: 'list' object has no attribute 'lower'. If I set no num_processes all is working.

The loop in simple_preprocessing_steps in combination with process_map breaks the documents into letters. See below


import os
import string
from octis.preprocessing.preprocessing import Preprocessing
import pandas as pd

docs = pd.read_csv('tweets.csv',lineterminator='\n')
docs['clean_tweets'].to_csv('documents.txt', header=None,  sep='\n', mode='w', encoding="utf-8")





preprocessor = Preprocessing( max_features=None,
                             remove_punctuation=True, punctuation=string.punctuation,
                             lemmatize=True, stopword_list='german',
                             min_chars=1, min_words_docs=0,  language= 'german', split = False, num_processes= 36, max_df= 0.9, min_df = 0.05)
# preprocess
dataset = preprocessor.preprocess_dataset(documents_path='documents.txt')

Traceback (most recent call last):
  File "/home/p/p_drec01/lda/preprocess_lda_test.py", line 40, in <module>
    dataset = preprocessor.preprocess_dataset(documents_path='/scratch/tmp/p_drec01/lda/octis_data/documents.txt')
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/octis/preprocessing/preprocessing.py", line 171, in preprocess_dataset
    vocabulary = self.filter_words(docs)
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/octis/preprocessing/preprocessing.py", line 290, in filter_words
    vectorizer.fit_transform(docs)
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1846, in fit_transform
    X = super().fit_transform(raw_documents)
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1202, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents,
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1114, in _count_vocab
    for feature in analyze(doc):
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 104, in _analyze
    doc = preprocessor(doc)
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 69, in _preprocess
    doc = doc.lower()
AttributeError: 'list' object has no attribute 'lower'





 ##############
documents_path = 'documents.txt'
docs2 = [line.strip() for line in open(documents_path, 'r').readlines()]

def simple_preprocessing_steps( docs):
        tmp_docs = []
        for d in docs:
            print(d)
        

docs2 = process_map(simple_preprocessing_steps, docs2, max_workers=16, chunksize=1)

Ü
b
e
r

6
"

U
M
n


etc.

The text was updated successfully, but these errors were encountered:

Edilson-R · 2023-05-29T14:49:11Z

I have the same problem. Load a custom dataset.

Python 3.10.11
OCTIS 1.12.1
System: Windows 10

Code:
import os
import string
import spacy
from octis.preprocessing.preprocessing import Preprocessing

preprocessor = Preprocessing(lowercase = True, vocabulary = None, max_features = None,
remove_punctuation = True, punctuation = string.punctuation,
lemmatize = True, language = 'portuguese', remove_numbers = True,
min_chars = 4, remove_stopwords_spacy = True, min_df = 0.1, max_df = 0.8, num_processes = 7)

AttributeError: 'list' object has no attribute 'lower'

vinnyricciardi · 2023-06-21T18:14:13Z

I'm getting the same issue. The issue only seems to persist if, when using Preprocessing, num_processes is not None or if split=True. Seems like these functions transform a list of strings (e.g., ['dog', 'cat']) to a list of a list of strings (e.g., [['d', 'o', g'], ['c', 'a', 't']])

silviatti mentioned this issue Jun 23, 2023

Fix preprocessing with num_processes argument #96

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'list' object has no attribute 'lower' preprocessor.preprocess_dataset when num_processes != None #99

AttributeError: 'list' object has no attribute 'lower' preprocessor.preprocess_dataset when num_processes != None #99

p-dre commented Mar 6, 2023 •

edited

Loading

Edilson-R commented May 29, 2023 •

edited

Loading

vinnyricciardi commented Jun 21, 2023 •

edited

Loading

AttributeError: 'list' object has no attribute 'lower' preprocessor.preprocess_dataset when num_processes != None #99

AttributeError: 'list' object has no attribute 'lower' preprocessor.preprocess_dataset when num_processes != None #99

Comments

p-dre commented Mar 6, 2023 • edited Loading

Description - What I Did

Edilson-R commented May 29, 2023 • edited Loading

vinnyricciardi commented Jun 21, 2023 • edited Loading

p-dre commented Mar 6, 2023 •

edited

Loading

Edilson-R commented May 29, 2023 •

edited

Loading

vinnyricciardi commented Jun 21, 2023 •

edited

Loading