If you want to use this code, please cite our article describing this solution:
IEEE style
W. Brach, K. Košťál and M. Ries, "Can Large Language Model Detect Plagiarism in Source Code?," 2024 IEEE International Conference on Foundation and Large Language Models (FLLM2024), Dubai, United Arab Emirates, 2024, pp. 1-8.
We're trying to build a system for source code plagiarism detection using Large Language Models (LLMs) via the DSPy framework. The goal is to compare two input code files, determine if plagiarism has occurred, and provide an explanation for the result.
# Clone the repository
git clone https://github.com/fiit-ba/LLM-plagiarism-check.git
cd LLM-plagiarism-check
# Create a virtual environment
python3 -m venv llm-plagiarism-check
# Activate the virtual environment
source llm-plagiarism-check/bin/activate
# Install the required packages
pip install -r requirements.txt
Our project consists of several key components, each serving a specific purpose in our research workflow:
- check.ipynb: This is where we compile and train our DSPy programs.
- eval.ipynb: Use this notebook to evaluate the performance of our DSPy programs.
- jplag.ipynb: Run this to calculate the JPlag benchmark.
- analysis.ipynb: This notebook contains all our plots and analysis of results.
- dataloader.py: Provides support for loading our research data.
- models.py: Contains the model definitions for our DSPy programs.
data/IR-Plag-Dataset/
: This directory contains our plagiarism dataset, sourced from this GitHub repository.data/jplag/
: Used for the JPlag benchmark calculations.data/metadata/
: Stores metadata for our DSPy programs.data/results/
: Where we save our research results.data/train.tsv
: Our training dataset for DSPy.programs/
: Contains DSPy programs.
William Brach - @williambrach - william.brach@stuba.sk