无法在生成的 generated_predictions.jsonl 中保留额外字段并丢失 <image> 标记 #6070

enerai · 2024-11-19T02:44:24Z

Reminder

I have read the README and searched the existing issues.

System Info

...

Reproduction

问题描述

在使用批量预测指令时，如何能够将输入数据中的 extrainfo1 和 extrainfo2 字段保留到 generated_predictions.jsonl 文件中？此外，我发现输出的 prompt 字段中未包含输入中的 <image> 标记。

使用指令

torchrun ${DISTRIBUTED_ARGS} src/train.py \
    --stage sft \
    --do_predict \
    --predict_with_generate \
    --use_fast_tokenizer \
    --flash_attn auto \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --eval_dataset ${eval_dataset} \
    --output_dir $OUTPUT_PATH \
    --template qwen2_vl \
    --finetuning_type full \
    --do_sample False \
    --max_new_tokens 4 \
    --repetition_penalty 1 \
    --length_penalty 1 \
    --num_beams 1 \
    --overwrite_cache \
    --overwrite_output_dir \
    --per_device_eval_batch_size 2 \
    --ddp_timeout 9000 \
    --logging_steps 1 \
    --cutoff_len 4096 \
    --bf16

输入数据格式

每行数据如下：

{
  "messages": [
    {"content": "...", "role": "user"},
    {"content": "...", "role": "assistant"}
  ],
  "images": [],
  "extrainfo1": "...",
  "extrainfo2": "..."
}

期望输出

希望在 generated_predictions.jsonl 文件中保留 extrainfo1 和 extrainfo2 字段，生成的字段应包括：

{
  "prompt": "...",
  "label": "...",
  "predict": "...",
  "extrainfo1": "...",
  "extrainfo2": "..."
}

目前行为

generated_predictions.jsonl 中缺少 extrainfo1 和 extrainfo2 字段。
prompt 字段中丢失了 <image> 标记。

Expected behavior

No response

Others

No response

The text was updated successfully, but these errors were encountered:

github-actions bot added the pending This problem is yet to be addressed label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

无法在生成的 generated_predictions.jsonl 中保留额外字段并丢失 <image> 标记 #6070

无法在生成的 generated_predictions.jsonl 中保留额外字段并丢失 <image> 标记 #6070

enerai commented Nov 19, 2024 •

edited

Loading

无法在生成的 generated_predictions.jsonl 中保留额外字段并丢失 <image> 标记 #6070

无法在生成的 generated_predictions.jsonl 中保留额外字段并丢失 <image> 标记 #6070

Comments

enerai commented Nov 19, 2024 • edited Loading

Reminder

System Info

Reproduction

问题描述

使用指令

输入数据格式

期望输出

目前行为

相关问题

Expected behavior

Others

enerai commented Nov 19, 2024 •

edited

Loading