Skip to content

Commit

Permalink
move redundant data preprocess files
Browse files Browse the repository at this point in the history
  • Loading branch information
ruiyiw committed Nov 6, 2023
2 parents 652e7b0 + 9abef83 commit ce6344e
Show file tree
Hide file tree
Showing 9 changed files with 538 additions and 2 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ We split our overall framework into multiple parts
2. Together AI Finetuning --> Input the train and test data / Output model checkpoint
3. LLM Finetuning --> Input the train and test data / Output model checkpoint
4. LLM Deplyment --> Input LLM Finetuned model checkpoint / Output Deployable OpenAI type API
5. Eval --> Input model checkpoint / Output evaluation scores
5. Eval --> Input model checkpoint / Output evaluation scores
6. Generate --> Input None / Output new data on redis
170 changes: 169 additions & 1 deletion llm_deploy/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,173 @@
## Deploy lora-finetuned model using vLLM variance

We need to use an unmerged branch to support deploying lora-finetuned model. (the forked repo is https://github.com/troph-team/vllm.git)

Go to the vllm dir and pip install -e .

To notice https://github.com/vllm-project/vllm/issues/1283, need to modify the config file to "== 2.0.1" and the pytorch version if facing with CUDA version error.
To notice https://github.com/vllm-project/vllm/issues/1283, need to modify the config file to "== 2.0.1" and the pytorch version if facing with CUDA version error.


## Setting up Babel server
### Login with SSH key
Add public ed25519 key to server
```bash
ssh-copy-id -i ~/.ssh/id_ed25519.pub <username>@<mycluster>
```
Config SSH file
```bash
Host <mycluster>
HostName <mycluster>
User <username>
IdentityFile ~/.ssh/id_ed25519
```
Login babel with SSH key
```bash
ssh <mycluster>
```

### Connecting to a compute node
Jump from login node to compute node
```bash
srun --pty bash
```
Check if you can access the /data/folder
```bash
cd /data/datasets/
```

### Config environment on the compute node
Install miniconda
```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda init
conda create --name myenv
conda activate myenv
# conda deactivate
```
Install vllm packages
```bash
conda install pip
pip install vllm
```
Install fastchat packages
```bash
conda install pip
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip3 install --upgrade pip
pip3 install "fschat[model_worker,webui]"
```
Submit gpu request and open a an interactive terminal
```bash
srun --gres=gpu:1 --time=1-00:00:00 --mem=80G --pty $SHELL
conda activate myenv
```
Some useful commands for checking gpu jobs
```bash
# check slurm status
squeue -l
# check gpu status
nvidia-smi
# check gpu usage
pip install gpustat
watch -n 1 gpustat
# quit slurm jobs
scancel job_id
# connect to compute node directly
ssh -J babel babel-x-xx
```

### Install cuda-toolkit (optional)
Due to the issue with vllm: https://github.com/vllm-project/vllm/issues/1283, we need to use cuda-toolkit=11.7.0 that is compatible with Pytorch 2.0.1.
Install cuda-toolkit=11.7.0 on conda environment
```bash
conda install -c "nvidia/label/cuda-11.7.0" cuda-toolkit
```
Check cuda-toolkit version
```bash
nvcc -V
```

## Deploy models on Babel via FastChat API server
Implement the following python commands in three separate interactive terminal windows:
```bash
python3 -m fastchat.serve.controller
python3 -m fastchat.serve.model_worker --model-path model-checkpoint
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
```
Call model checkpoint API
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "model-checkpoint",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
```
*Sample output:*
```JSON
{"id":"cmpl-GGvKBiZFdFLzPq2HdtuxbC","object":"text_completion","created":1698692212,"model":"checkpoint-4525","choices":[{"index":0,"text":"city that is known for its icon","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":5,"total_tokens":11,"completion_tokens":6}}
```

## Deploy models on Babel via vllm API server
Start vLLM surver with model checkpoint
```bash
python -m vllm.entrypoints.openai.api_server --model model_checkpoint/
```
Call model checkpoint API
```bash
curl http://localhost:8000/v1/models
```
*Sample output:*
```JSON
{"object":"list","data":[{"id":"Mistral-7B-Instruct-v0.1/","object":"model","created":1697599903,"owned_by":"vllm","root":"Mistral-7B-Instruct-v0.1/","parent":null,"permission":[{"id":"modelperm-d415ecf6362a4f818090eb6428e0cac9","object":"model_permission","created":1697599903,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}
```
Inference model checkpoint API
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "model_checkpoint",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
```
*Sample output:*
```JSON
{"id":"cmpl-bf7552957a8a4bd89186051c40c52de4","object":"text_completion","created":3600699,"model":"Mistral-7B-Instruct-v0.1/","choices":[{"index":0,"text":" city that is known for its icon","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}
```

## Access deployed Babel server on a local machine
Construct ssh tunnel between babel login node and babel compute node with hosted model
```bash
ssh -N -L 7662:localhost:8000 username@babel-x-xx
```
The above command creates a localhost:7662 server on bable login node which connects to localhost:8000 on compute node.

Construct ssh tunnel between local machine and babel login node
```bash
ssh -N -L 8001:localhost:7662 username@<mycluster>
```
The above command creates a localhost:8001 server on your local machine which connects to localhost:7662 on babel login node.

Call hosted model on local machine
```bash
curl http://localhost:8001/v1/models
```
If the above command runs successfully, you should be able to use REST API on your local machine.

(optional) If you fail in building the ssh tunnel, you may add `-v` to the ssh command to see what went wrong.




## Userful resource links for babel
1. https://hpc.lti.cs.cmu.edu/wiki/index.php?title=BABEL#Cluster_Architecture
2. https://hpc.lti.cs.cmu.edu/wiki/index.php?title=VSCode
3. https://hpc.lti.cs.cmu.edu/wiki/index.php?title=Training_Material
4. https://hpc.lti.cs.cmu.edu/wiki/index.php?title=Connecting_to_the_Cluster#Copying_Data_to_Compute_Nodes

9 changes: 9 additions & 0 deletions llm_generate/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Data Generation

For the first step, we generate envProfile (including scenario / social goal / relationship restriction) based on inspiring prompt.

For the second step, we put the original agentProfile and relationshipProfile into our new redis database

For the third step, we combine them together to be combos based on conditiona sampling (the restriction is the relationship)

All the EnvProfile (new generated), AgentProfile (sotopia original), RelationshipProfile (sotopia original), and envagentcombo are on the redis database that is new created.
135 changes: 135 additions & 0 deletions llm_generate/generate_specific_envs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
"""This file is used to generate specific environments based on existing
datasets. The generation functions below should call agenerate_env_profile
in `sotopia/generation_utils/generate.py` with the appropriate parameters.
Here are the datasets we have so far:
1. Mutual-Friend (https://huggingface.co/datasets/mutual_friends)
"""
import asyncio
from typing import Hashable

import datasets
import names
import numpy as np
from datasets import DatasetDict, load_dataset

from generate import (
ListOfStrOutputParser,
StrOutputParser,
agenerate,
generate,
)


async def generate_mutual_friend_envs() -> tuple[str, list[str]]:
"""Generate environments based on the mutual-friend dataset."""
mutual_friend_dataset: DatasetDict = load_dataset("mutual_friends")
all_data = mutual_friend_dataset["train"]
# sample one datum from all data
datum = np.random.choice(all_data)
friends = datum["scenario_kbs"]
num_of_friends_in_total = sum(map(len, friends))
# generate names for the friends
set_of_names = set()
for _ in range(num_of_friends_in_total):
name = names.get_first_name()
while name in set_of_names:
name = names.get_first_name()
set_of_names.add(name)
list_of_names = list(set_of_names)
friend_map: dict[tuple[str, ...], str] = {}
friend_list_map: list[list[str]] = [[] for _ in range(len(friends))]
friend_description_keys: list[str] = datum["scenario_attributes"]["name"]
name_pointer = 0
for i, friends_array in enumerate(friends):
for friend in friends_array:
assert (
len(friend) == 2
) # in [[key1, key2, ...], [value1, value2, ...]] format
if not tuple(friend[1]) in friend_map:
friend_map[tuple(friend[1])] = list_of_names[name_pointer]
name_pointer += 1
friend_list_map[i].append(friend_map[tuple(friend[1])])
friend_set_map: list[set[str]] = [
set(friend_list) for friend_list in friend_list_map
]
common_friends = []
for friend_description, friend_name in friend_map.items():
if all([friend_name in friend_set for friend_set in friend_set_map]):
common_friends.append(friend_name)
scenario = (
f'{len(friends)} strangers are meeting at a party. <p viewer="environment">They have {len(common_friends)} common friends: '
f"{', '.join(common_friends[:-1])}"
+ (" and " if len(common_friends) > 1 else "")
+ common_friends[-1]
+ ".</p>"
)
goals: list[str] = []
for friends_array in friends:
template = f"You are trying to figure out whether you have a mutual friend with the other person. \n"
template += f"<extra_info> You know the following friends"
for friend in friends_array:
friend_name = friend_map[tuple(friend[1])]
friend_description = friend[1]
template += f" {friend_name}: {' '.join([(i + ': ' + j + ' ') if i != 'Name' else '' for i, j in zip(friend[0], friend_description)])}\n"
template += f"</extra_info>"
goals.append(template)

return scenario, goals


async def generate_craigslist_bargains_envs() -> tuple[str, list[str]]:
"""Generate environments based on the craigslist_bargains dataset."""
craigslist_bargains_dataset: DatasetDict = load_dataset(
"craigslist_bargains"
)
all_data = craigslist_bargains_dataset["train"]
# sample one datum from all data
datum = np.random.choice(all_data)
scenario = generate(
model_name="gpt-4",
template="The following sentence is automatically generated with the following"
'template: "One person is selling <item> for <price>, another person is'
'trying to buy it. Here is the description of the item: <description>." with item = {title}, '
"price={price}, and description={description} Please make the sentence"
"fluent and natural.",
input_values={
"title": datum["items"]["Title"][0],
"price": datum["items"]["Price"][0],
"description": datum["items"]["Description"][0],
},
output_parser=StrOutputParser(),
)

goals: list[str] = []
for i in range(2):
if datum["agent_info"]["Role"][i] == "seller":
markup_ratio = np.random.exponential(0.5)
datum["agent_info"]["Target"][i] = datum["items"]["Price"][0] / (
1 + markup_ratio
)
goal = generate(
model_name="gpt-4",
template="The following sentence is automatically generated with the following"
'template: "You want to <role> this item. Your target price '
"is $<price> (round up to two decimals). You will get penalty if you sell or buy it "
"for a price that is significantly lower than (if <role> is seller) or significantly"
"higher than (if <role> is buyer) the target price, but will get bonus if you successfully "
"sell it higher than the target price (if <role> is seller) or buy it for lower than"
'the target price (if <role> is buyer)." '
"with role = {role} and price = {price}. Please make the sentence"
"fluent and natural. Do not change the original meaning of the sentence.",
input_values={
"role": datum["agent_info"]["Role"][i],
"price": datum["agent_info"]["Target"][i],
},
output_parser=StrOutputParser(),
)
goals.append(goal)

return scenario, goals


if __name__ == '__main__':
for i in range(10):
scenario, goals = asyncio.run(generate_mutual_friend_envs())
import pdb; pdb.set_trace()
1 change: 1 addition & 0 deletions llm_generate/requirments.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sotopia
51 changes: 51 additions & 0 deletions llm_generate/step1_generate_env_profile.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
import asyncio
import random
from typing import TypeVar
from tqdm import tqdm

import pandas as pd
import rich
from pydantic import BaseModel

from sotopia.database import EnvironmentProfile
from sotopia.generation_utils.generate import agenerate_env_profile

random.seed(41)

env_borrowMoney = EnvironmentProfile.find(
EnvironmentProfile.codename == "borrow_money"
).all()[0]
env_roadtrip = EnvironmentProfile.find(
EnvironmentProfile.codename == "take_turns"
).all()[0]
env_prisonerDillema = EnvironmentProfile.find(
EnvironmentProfile.codename == "prison_dilemma"
).all()[0]

examples = f"{env_borrowMoney.json()}\n\n{env_roadtrip.json()}\n\n{env_prisonerDillema.json()}"

ins_prompts = pd.read_csv("./inspirational_prompt_for_env.csv")
prompts = ins_prompts["prompt"].tolist()

T = TypeVar("T", bound=BaseModel)


def pydantics_to_csv(filename: str, data: list[T]) -> None:
pd.DataFrame([item.dict() for item in data]).to_csv(filename, index=False)


backgrounds = []
for prompt in tqdm(prompts):
rich.print(prompt)
background, prompt_full = asyncio.run(
agenerate_env_profile(
model_name="gpt-4",
inspiration_prompt=prompt,
examples=examples,
)
)
rich.print(background)
rich.print(prompt_full)
backgrounds.append(background)

pydantics_to_csv("./backgrounds.csv", backgrounds)
Loading

0 comments on commit ce6344e

Please sign in to comment.