Leverage the power of AI to enhance your projects, experiment with new ideas, and deepen your understanding of Large Language Models (LLMs) by fine-tuning a language model tailored to the NEAR ecosystem.
- Introduction
- Prerequisites
- Understanding Fine-Tuning
- Setup and Installation
- Configuring the Project
- Data Collection
- Data Processing
- Generating Refined Examples
- Preparing the Training Data
- Fine-Tuning the Model
- Using the Fine-Tuned Model
- Troubleshooting Common Errors
- Evaluating the Model
- Conclusion
- References
- Appendix: Understanding the Codebase
Welcome to this comprehensive guide on fine-tuning a language model specifically for the NEAR ecosystem. This tutorial is designed to educate developers who are eager to delve into the world of fine-tuning Large Language Models (LLMs). By the end of this guide, you'll have a deep understanding of the fine-tuning process, from data collection to deploying your customized model.
Before diving in, ensure you have the following:
- Programming Knowledge: Basic understanding of Python programming.
- Familiarity with LLMs: General knowledge of Large Language Models and their applications.
- Development Environment:
- A computer with internet connectivity.
- Permissions to install software.
- Python 3.8 or higher installed.
- Accounts and API Keys:
- OpenAI account with API access.
- GitHub account (optional, for accessing private repositories).
- Environment Setup:
- Familiarity with using virtual environments in Python.
Fine-tuning is the process of taking a pre-trained language model and further training it on a custom dataset to specialize it for specific tasks or domains. This allows the model to generate more relevant and accurate responses in the desired context.
- Domain Specificity: Tailor the model to understand and generate content related to the NEAR codebase.
- Improved Performance: Enhance the accuracy and relevance of responses for NEAR-related queries.
- Customization: Implement specific styles, terminologies, or formats required by your application.
OpenAI provides APIs to fine-tune models like gpt-4o
, enabling developers to customize models for their specific needs while leveraging the robust capabilities of large pre-trained models.
Start by cloning the repository containing the fine-tuning codebase:
git clone https://github.com/jbarnes850/near-fine-tuning-job.git
cd near-fine-tuning-job
Creating a virtual environment is a best practice to manage dependencies:
python -m venv near_env
source near_env/bin/activate # On Windows, use `near_env\Scripts\activate`
Install the required Python packages:
pip install -r requirements.txt
This will install all necessary libraries, including OpenAI's Python SDK, tiktoken
, and others.
Create a .env
file in the project root directory and add your API keys:
OPENAI_API_KEY=your_openai_api_key
GITHUB_API_KEY=your_github_api_key # Required if accessing private repositories
Alternatively, you can export them in your shell:
export OPENAI_API_KEY='your_openai_api_key'
export GITHUB_API_KEY='your_github_api_key'
Modify config.yaml
to customize the fine-tuning process. Key sections include:
github:
repos:
- "near/docs"
- "near/neps"
# Add more repositories as needed
articles:
urls:
- "https://near.org/blog/near-protocol-economics/"
- "https://near.org/blog/understanding-nears-nightshade-sharding-design/"
# Add more articles as needed
openai:
model: "gpt-4o-mini-2024-07-18" # Ensure this is a model that supports fine-tuning
temperature: 0.7
max_tokens: 1000
system_prompt: "You are an AI assistant specializing in NEAR Protocol and blockchain technology."
fine_tuning:
model: "gpt-4o-2024-08-06"
n_epochs: 4
suffix: "NEAR_Ecosystem_Model"
monitoring_interval: 60 # In seconds
The first step in fine-tuning is collecting relevant data.
We fetch code and documentation from specified GitHub repositories.
- File:
fine_tuning/data_fetchers.py
- Function:
fetch_repo_data
def fetch_repo_data(self, repo_name):
"""Fetch data from a GitHub repository."""
# Check if cached data exists
if self.use_cache and self.is_data_cached(repo_name, is_repo=True):
logging.info(f"Using cached data for repository: {repo_name}")
data = self.load_cached_data(repo_name, is_repo=True)
else:
logging.info(f"Fetching repository data: {repo_name}")
# Fetch data from GitHub
data = self._fetch_repo_contents(repo_name)
self.cache_data(repo_name, data, is_repo=True)
return data
We scrape content from specified web articles.
- File:
fine_tuning/data_fetchers.py
- Function:
fetch_article_data
def fetch_article_data(self, url):
"""Fetch data from a web article."""
# Check if cached data exists
if self.use_cache and self.is_data_cached(url, is_repo=False):
logging.info(f"Using cached data for article: {url}")
data = self.load_cached_data(url, is_repo=False)
else:
logging.info(f"Fetching article data from: {url}")
# Fetch data from the web
data = self._fetch_web_content(url)
self.cache_data(url, data, is_repo=False)
return data
After collecting data, we need to process it into a suitable format for fine-tuning.
We split code files into manageable chunks.
- File:
fine_tuning/data_processors.py
- Function:
process_repo_data
def process_repo_data(self, all_repo_data):
"""Process data from multiple repositories into prompts."""
processed_data = []
for repo_name, repo_files in all_repo_data.items():
for file_path, content in repo_files.items():
# Skip binary files or files that are too large
if self.is_binary_file(content) or self.is_large_file(content):
continue
splits = self.split_content(content, self.config['openai']['max_tokens'])
for split_content in splits:
prompt = f"Explain the following code snippet from `{file_path}` in the `{repo_name}` repository:\n```{split_content}```"
processed_data.append({'prompt': prompt})
return processed_data
We split articles into sections.
- File:
fine_tuning/data_processors.py
- Function:
process_article_data
def process_article_data(self, all_article_data):
"""Process data from multiple articles into prompts."""
processed_data = []
for url, content in all_article_data.items():
splits = self.split_content(content, self.config['openai']['max_tokens'])
for split_content in splits:
prompt = f"Summarize the following section from the article at {url}:\n{split_content}"
processed_data.append({'prompt': prompt})
return processed_data
We ensure that content chunks do not exceed token limits.
def split_content(self, content, max_tokens):
"""Split content into chunks based on token limits."""
encoding = get_encoding('cl100k_base')
tokens = encoding.encode(content)
splits = []
start = 0
while start < len(tokens):
end = start + max_tokens
split_tokens = tokens[start:end]
split_content = encoding.decode(split_tokens)
splits.append(split_content)
start = end
return splits
We use an OpenAI model to generate refined examples.
- Purpose: Create high-quality question-answer pairs for fine-tuning.
- Process:
- Use prompts from processed data.
- Generate assistant responses.
- File:
fine_tuning/data_processors.py
- Function:
generate_refined_examples
def generate_refined_examples(self, processed_data):
"""Generate assistant responses for each prompt using OpenAI API."""
refined_examples = []
for data in tqdm(processed_data, desc="Generating refined examples"):
messages = [
{"role": "system", "content": self.config['openai']['system_prompt']},
{"role": "user", "content": data['prompt']}
]
try:
response = openai.ChatCompletion.create(
model=self.config['openai']['model'],
messages=messages,
temperature=self.config['openai']['temperature'],
max_tokens=self.config['openai']['max_tokens']
)
assistant_message = response.choices[0].message.content
refined_examples.append({
"messages": [
{"role": "user", "content": data['prompt']},
{"role": "assistant", "content": assistant_message}
]
})
except Exception as e:
logging.error(f"Failed to generate response for prompt: {data['prompt']}\nError: {e}")
return refined_examples
Note: Ensure that you have correctly imported and initialized the openai
module:
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
We prepare the data in the format required by OpenAI's fine-tuning API.
Ensure each example meets the required format.
def validate_example(self, example):
"""Validate the structure of a training example."""
required_keys = {'messages'}
if not isinstance(example, dict):
return False
if not required_keys.issubset(example.keys()):
return False
if not isinstance(example['messages'], list):
return False
for message in example['messages']:
if 'role' not in message or 'content' not in message:
return False
if message['role'] not in ['system', 'user', 'assistant']:
return False
if not isinstance(message['content'], str) or not message['content'].strip():
return False
return True
We save the validated examples in a .jsonl
file.
def save_as_jsonl(self, data, output_file="fine_tuning_data.jsonl"):
"""Save data to a JSONL file with UTF-8 encoding and proper escaping."""
with open(output_file, 'w', encoding='utf-8') as f:
for item in data:
json_line = json.dumps(item, ensure_ascii=False)
f.write(json_line + '\n')
logging.info(f"Fine-tuning data saved to {output_file}")
We upload the training data and start the fine-tuning job.
- File:
fine_tuning/fine_tuning.py
- Function:
upload_training_file
def upload_training_file(self, file_path):
"""Upload the training file to OpenAI with validation."""
with open(file_path, 'rb') as f:
response = openai.File.create(
file=f,
purpose='fine-tune'
)
file_id = response['id']
logging.info(f"Training file uploaded successfully. File ID: {file_id}")
return file_id
- File:
fine_tuning/fine_tuning.py
- Function:
create_fine_tune_job
def create_fine_tune_job(self, training_file_id):
"""Create a fine-tuning job in OpenAI."""
response = openai.FineTune.create(
training_file=training_file_id,
model=self.config['fine_tuning']['model'],
n_epochs=self.config['fine_tuning']['n_epochs'],
suffix=self.config['fine_tuning'].get('suffix', '')
)
job_id = response['id']
logging.info(f"Fine-tuning job created successfully. Job ID: {job_id}")
return job_id
- File:
fine_tuning/fine_tuning.py
- Function:
monitor_fine_tune_job
def monitor_fine_tune_job(self, job_id):
"""Monitor the fine-tuning job until completion."""
logging.info(f"Monitoring fine-tuning job: {job_id}")
while True:
try:
response = openai.FineTune.retrieve(job_id)
status = response['status']
logging.info(f"Job status: {status}")
if status == 'succeeded':
model_id = response['fine_tuned_model']
logging.info(f"Fine-tuning succeeded. Fine-tuned model ID: {model_id}")
return model_id
elif status in ['failed', 'cancelled']:
error_message = response.get('status_details', 'No details provided.')
logging.error(f"Fine-tuning {status}. Reason: {error_message}")
return None
else:
time.sleep(self.config['fine_tuning'].get('monitoring_interval', 60))
except openai.error.OpenAIError as e:
logging.error(f"Error while checking fine-tuning job status: {e}")
time.sleep(self.config['fine_tuning'].get('monitoring_interval', 60))
- File:
fine_tuning/main.py
if __name__ == "__main__":
main()
Run the script:
python -m fine_tuning.main
Once fine-tuning is complete and you have your model_id
, you can use the fine-tuned model:
import openai
openai.api_key = 'your_openai_api_key'
response = openai.ChatCompletion.create(
model='your_fine_tuned_model_id',
messages=[
{"role": "system", "content": "You are a NEAR Protocol expert."},
{"role": "user", "content": "Explain NEAR's sharding mechanism."}
]
)
print(response.choices[0].message.content)
Fine-tuning language models can present various challenges. Below are some common errors you might encounter during the fine-tuning process and their solutions.
Error Message: AuthenticationError: Incorrect API key provided
Cause: This error occurs when the OpenAI API key is missing, incorrect, or improperly configured.
Solution:
- Check API Key: Ensure that your
OPENAI_API_KEY
is correctly set in your environment variables or in the.env
file. - Correct Usage: Verify you are accessing the API key without quotes if set in the environment variables.
- Update Configuration: Make sure the API key is properly loaded in your script (e.g., using
os.getenv("OPENAI_API_KEY")
).
Error Message: RateLimitError: You exceeded your current quota, please check your plan and billing details.
Cause: This indicates you've exceeded your allocated usage quota for the OpenAI API.
Solution:
- Check Usage Dashboard: Visit the OpenAI Usage Dashboard to monitor your usage.
- Upgrade Plan: Consider upgrading your subscription plan for higher quotas.
- Optimize Requests: Reduce the number of API calls or optimize your code to make efficient use of the API.
Error Message: InvalidRequestError: This model does not support fine-tuning.
Cause: Attempting to fine-tune a model that doesn't support fine-tuning.
Solution:
- Supported Models: Ensure you're using a model that supports fine-tuning, such as
gpt-3.5-turbo
. - Update Configuration: Modify the
model
parameter in yourconfig.yaml
and code to use a fine-tune-compatible model.
Error Message: InvalidRequestError: The file is not formatted correctly.
Cause: The training file is not in the required JSONL format or contains invalid data.
Solution:
- Validate JSONL File: Check the training file for proper JSON Lines formatting.
- Use Validators: Utilize JSONL validators or linters to detect issues in the file.
- Correct Data Structure: Ensure each line in the file is a valid JSON object with the required fields.
Error Message: APIConnectionError: Error communicating with OpenAI
Cause: Network connectivity issues between your environment and the OpenAI API servers.
Solution:
- Check Internet Connection: Ensure your network connection is stable.
- Retry Logic: Implement retry logic with exponential backoff in your API calls.
- Firewall Settings: Verify that your firewall or proxy settings are not blocking API requests.
After fine-tuning, it's essential to evaluate your model to ensure it meets your performance expectations.
Assess the model's ability to answer questions related to the NEAR Protocol.
Example:
import openai
openai.api_key = 'your_openai_api_key'
def ask_near_question(question):
response = openai.ChatCompletion.create(
model='your_fine_tuned_model_id',
messages=[
{"role": "user", "content": question}
],
temperature=0.5
)
return response.choices[0].message.content
# Test the model
question = "How does NEAR's consensus mechanism work?"
answer = ask_near_question(question)
print(f"Q: {question}\nA: {answer}")
Consider quantitative metrics to evaluate your model:
- Response Accuracy: Manually review responses for correctness.
- Relevance Score: Rate how relevant the responses are to the questions asked.
- Completeness: Check if the model provides comprehensive answers.
Gather feedback from end-users or testers:
- Surveys and Questionnaires: Collect user opinions on the model's performance.
- Error Reporting: Encourage reporting of any incorrect or unsatisfactory responses.
Compare the fine-tuned model against the base model:
- Baseline Comparison: Use the original
gpt-3.5-turbo
model to answer the same set of questions. - Evaluate Improvements: Identify areas where the fine-tuned model performs better.
Congratulations on fine-tuning your custom NEAR Protocol language model! This tailored model should provide more accurate and relevant responses for NEAR-related queries, enhancing your applications and user experience.
Key Takeaways:
- Fine-tuning allows you to specialize a general-purpose model for specific domains.
- Proper data collection and processing are critical for effective fine-tuning.
- Always evaluate and iterate on your model to maintain and improve performance.
- OpenAI Fine-Tuning Documentation: Fine-tuning Guide
- NEAR Official Site: NEAR.org
- NEAR Documentation: NEAR Docs
- GitHub API Documentation: GitHub REST API
For a deeper understanding of the project's structure and components, review the following sections.
near-fine-tuned-model/
├── fine_tuning/
│ ├── __init__.py
│ ├── data_fetchers.py
│ ├── api_clients.py
│ ├── config.py
│ ├── data_processors.py
│ ├── fine_tuning.py
│ ├── main.py
│ ├── utils.py
├── tests/
│ ├── __init__.py
│ ├── test_file_upload.py
│ ├── test_fine_tuning_creation.py
│ ├── test_job_monitoring.py
│ └── test_model_evaluation.py
├── config.yaml
├── config_template.yaml
├── requirements.txt
├── README.md
├── model_card.md
├── LICENSE
└── .env.example
- data_fetchers.py: Retrieves data from GitHub repositories and web articles.
- data_processors.py: Processes and cleans the fetched data, prepares prompts.
- fine_tuning.py: Handles interactions with the OpenAI API for file upload and fine-tuning jobs.
- main.py: The primary script that orchestrates data fetching, processing, and fine-tuning.
The config.yaml
file contains all configurable parameters:
- GitHub Repositories: Specify repositories to fetch data from.
- Articles: List of article URLs to include in the dataset.
- OpenAI Settings: Model choice, temperature, max tokens, and system prompts.
- Fine-Tuning Parameters: Epochs, model suffix, and monitoring intervals.
Store sensitive information like API keys:
OPENAI_API_KEY=your_openai_api_key
GITHUB_API_KEY=your_github_api_key
Security Tip: Never commit the .env
file to version control systems.
- NEAR Developer Community: Engage with other developers in the NEAR Discord Channel.
- OpenAI Community Forum: Discuss and seek assistance at the OpenAI Community.
This guide is intended to empower developers to harness the capabilities of fine-tuned language models within the NEAR ecosystem. Continue exploring and innovating!