This project focuses on processing query and document texts using Python and the Natural Language Toolkit (nltk) library. The main goal is to prepare the texts for information retrieval tasks by performing several preprocessing steps, including tokenization, punctuation removal, conversion to lowercase, removal of stopwords, and stemming. The project includes two main scripts (lab3.py
and lab3.5.py
) that process document texts and query texts respectively.
- lab3.py: A Python script for processing document texts.
- lab3.5.py: A Python script for processing query texts.
- npl.tar.gz: A compressed file containing additional data required for the scripts.
This snippet reads the content of doc-text
and tokenizes the text using the nltk
library.
import os
import nltk
nltk.download('punkt')
# Path to the input file
input_file = r'C:\Users\mini_\OneDrive\Documentos\Code Test\TEST 1\lab3\npl\doc-text'
# Read the content of the external file
with open(input_file, 'r', encoding='utf-8') as archivo:
texto = archivo.read()
# Tokenization using NLTK
palabras = nltk.word_tokenize(texto)
This snippet defines a regular expression to remove punctuation from the tokenized text.
import re
import string
# Define regular expression to remove punctuation
simbolos_extra = '’'
re_punc = re.compile('[%s%s]' % (re.escape(string.punctuation), re.escape(simbolos_extra)))
# Replace "|" with space and remove other punctuation
stripped = re_punc.sub(lambda x: ' ' if x.group(0) == '|' else '', texto)
This snippet demonstrates the use of PorterStemmer to stem the cleaned text.
from nltk.stem.porter import PorterStemmer
# Stemming with PorterStemmer
stemmer = PorterStemmer()
# Split content into lines
lines = filtered_words.split('\n')
# Stem words in each line
stemmed_lines = []
for line in lines:
# Split line into words
words = line.split()
# Apply stemmer to each word
stemmed_words = [stemmer.stem(word) for word in words]
# Join stemmed words into a new line
stemmed_line = ' '.join(stemmed_words)
# Add stemmed line to list
stemmed_lines.append(stemmed_line)
# Join stemmed lines into a single document
stemmed_content = '\n'.join(stemmed_lines)
- Clone the repository to your local machine.
- Ensure you have Python and
nltk
installed. - Extract the
npl.tar.gz
file to obtain the required data files. - Run the
lab3.py
script to process document texts. - Run the
lab3.5.py
script to process query texts.
git clone https://github.com/KPlanisphere/query-document-processing.git
cd query-document-processing
tar -xzf npl.tar.gz
python lab3.py
python lab3.5.py
- Python
- NLTK library