A Step-by-Step Guide to Creating a Custom Vision-Language Dataset for Fine-Tuning Qwen-2-VL with LLaMA-Factory

Learn how to create a custom vision-language dataset and fine-tune Qwen_2_VL using LLaMA-Factory for specialized AI tasks like Document Visual Question Answering (DOCVQA).

9 min readOct 22, 2024

Introduction

Fine-tuning large language models (LLMs) for specialized tasks often requires a well-curated dataset, especially when working with vision-language models like Qwen-2-VL. Qwen-2-VL is a powerful tool for tasks that involve understanding and interpreting both text and images, making it ideal for scenarios like document analysis, visual question answering (VQA), and more. However, creating a custom dataset tailored to the requirements of such models can be challenging.

In this article, I’ll walk you through the entire process of creating a vision-language dataset for fine-tuning Qwen-2-VL using LLaMA-Factory, an open-source library designed for training and fine-tuning models. We’ll cover everything from preparing the data to uploading it to Hugging Face and finally integrating it into a fine-tuning script.

Prerequisites

Before we dive in, make sure you have:

Basic knowledge of Python programming.
A OpenAI and Hugging Face account and grab API tokens.
Basic understanding of Finetuning, LLaMA-Factory and Qwen-2-VL. If you’re new to these tools, you can check out their respective documentation:
Qwen-2-VL Fine-tuning Script
LLaMA-Factory

Step 1: Setting Up Your Environment

Make sure to install the required libraries before starting:

pip install openai pillow pandas datasets huggingface_hub

Step 2: Creating and Structuring the Dataset

For this example, we are building a dataset for Document VQA, where each image represents a contract document, and the dataset contains question-answer pairs derived from these images.

Preparing Images

Store all your contract images in a folder named images.
Resize the images to ensure they are manageable for the model:

from PIL import Image
import os

def resize_image(image_path, max_size=1024):
    with Image.open(image_path) as img:
        aspect_ratio = img.width / img.height
        new_width = max_size if img.width > img.height else int(max_size * aspect_ratio)
        new_height = max_size if img.height > img.width else int(max_size / aspect_ratio)
        resized_img = img.resize((new_width, new_height), Image.Resampling.LANCZOS)
        resized_img.save(image_path)

Generating Question-Answer Pairs

For each image, we use an LLM to generate question-answer pairs:

Example question: “What is the effective date of the contract?”
Example answer: “January 1, 2023”

Here’s a script to process images and generate QA pairs:

import os
import csv
import base64
from PIL import Image
from openai import OpenAI

OPENAI_API_KEY=""

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

def resize_image(image_path, max_size=1024):
    """
    Resize the image while maintaining aspect ratio.
    Overwrites the original image with the resized version.
    """
    with Image.open(image_path) as img:
        # Calculate the aspect ratio
        aspect_ratio = img.width / img.height
        
        # Determine new dimensions while maintaining aspect ratio
        if img.width > img.height:
            new_width = max_size
            new_height = int(max_size / aspect_ratio)
        else:
            new_height = max_size
            new_width = int(max_size * aspect_ratio)
        
        # Resize and save the image back to the original path
        resized_img = img.resize((new_width, new_height), Image.Resampling.LANCZOS)
        resized_img.save(image_path)
        print(f"Resized and saved: {image_path} to {new_width}x{new_height}")

def generate_question_answer_pairs(image_base64):
    """
    Function to call GPT-4 model and generate question-answer pairs based on the given image.
    """
    # Prepend the required prefix for base64 image data
    image_data_url = f"data:image/png;base64,{image_base64}"
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": """# ROLE: Vision language model dataset generator
#Mission: Analyze the given image to generate a list of question-answer pairs based on the text and information present. 
The questions should focus on key information present or not present in the document. 
For each question, if the answer is present in the document, provide the exact answer. 
If the information is not present or cannot be identified from the document, use the answer \"Not Present.\" 
Avoid using specific names of individuals when possible. 
The goal is to create both positive and negative answer sets to train the model to understand and distinguish between available and unavailable information.\

# Steps
1. Examine the document for key pieces of information.
2. Identify the following elements where applicable. 

For example: 
  - Organization names.
  - Titles and roles.
  - Dates (effective date, expiration date, etc.).
  - Signatures.
  - Specific contract terms, phrases, or numbers.

3. Formulate questions based on these identified elements.
4. Determine the answer for each question, whether it is directly available or \"Not Present\" if absent.\

# Output Format

- CSV format with two columns: \"Question\" and \"Answer\".
- Each row should represent one question-answer pair.
- Format each entry as follows: \"Question\",\"Answer\"
- Ensure the output is structured with each question-answer pair on a new line.
- Enclose within ```csv and ``` for post processing. 

"""        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": image_data_url
          }
        }
      ]
    }
  ],
        temperature=1,
        max_tokens=2048,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        response_format={"type": "text"}
    )
    
    # Extract and clean the response text
    response_text = response.choices[0].message.content
    # Remove ```csv delimiters and strip out extra triple quotes
    clean_csv = response_text.replace("```csv", "").replace("```", "").strip()
    clean_csv = clean_csv.replace('"""', '"')  # Replace triple quotes with single quotes
    
    # Filter out unwanted headers like "Question","Answer"
    cleaned_lines = [
        line for line in clean_csv.splitlines()
        if not line.lower().strip().startswith(("question", "answer"))  # Removes any header lines
    ]
    
    return "\n".join(cleaned_lines)

def get_processed_images(output_csv_path):
    """
    Reads the output CSV file and returns a set of image names that have already been processed.
    """
    processed_images = set()
    if os.path.exists(output_csv_path):
        with open(output_csv_path, mode='r') as csvfile:
            csv_reader = csv.reader(csvfile)
            next(csv_reader)  # Skip header
            for row in csv_reader:
                processed_images.add(row[0])  # Image name is in the first column
    return processed_images

def process_images_in_folder(folder_path, output_csv_path):
    """
    Function to scan a folder for images, resize, process each using GPT-4 model, and save results into a CSV file.
    Skips images that have already been processed.
    """
    # Get the set of processed images
    processed_images = get_processed_images(output_csv_path)
    
    # Open or create the CSV file for appending data
    with open(output_csv_path, mode='a', newline='') as csvfile:
        csv_writer = csv.writer(csvfile)
        
        # Write the header if the file is empty
        if os.stat(output_csv_path).st_size == 0:
            csv_writer.writerow(['Image Name', 'Question', 'Answer'])

        # Iterate through each image in the specified folder
        for image_filename in os.listdir(folder_path):
            if image_filename.lower().endswith(('.png', '.jpg', '.jpeg')):
                # Skip the image if it has already been processed
                if image_filename in processed_images:
                    print(f"Skipping already processed image: {image_filename}")
                    continue

                image_path = os.path.join(folder_path, image_filename)
                
                # Resize the image to a manageable size
                resize_image(image_path, max_size=1024)
                
                # Convert the resized image to base64
                with open(image_path, "rb") as image_file:
                    image_base64 = base64.b64encode(image_file.read()).decode('utf-8')
                
                # Generate question-answer pairs using the model
                qa_pairs = generate_question_answer_pairs(image_base64)
                
                # Split the clean CSV data into rows and write each row with the image name
                for row in qa_pairs.splitlines():
                    question, answer = row.split(',', 1)
                    csv_writer.writerow([image_filename, question.strip(), answer.strip()])

# Define folder path and output CSV path
input_folder_path = "images"
output_csv_path = "output.csv"

# Process the images and generate the CSV
process_images_in_folder(input_folder_path, output_csv_path)

This Python code is designed to process a folder of images, resize them, generate question-answer pairs from the content of each image using GPT-4, and save the results into a CSV file. It follows a systematic approach to ensure that images are not processed multiple times, resizing images to a manageable size before processing them with the GPT-4 model. The code handles image-to-text conversion by encoding the images in base64, sending the encoded image data to GPT-4, and parsing the response to extract question-answer pairs. Here is a detailed breakdown of its functionality:

Dependencies: The code imports several libraries including os for file and directory handling, csv for reading and writing CSV files, base64 for encoding images, PIL for image processing, and OpenAI for interacting with the OpenAI API.
API Initialization: It initializes the OpenAI client using an API key to enable interaction with the GPT-4 model.
resize_image Function: This function takes an image path and resizes the image to a maximum dimension (1024 pixels), maintaining its aspect ratio, and overwrites the original image.
generate_question_answer_pairs Function: This function sends a base64-encoded image to the GPT-4 model. The model is instructed to analyze the image and produce question-answer pairs, focusing on key information present in the image or marking answers as "Not Present" when the information is absent. The output is structured as CSV data.
get_processed_images Function: Reads the output CSV to gather a set of images that have already been processed, preventing redundant processing.
process_images_in_folder Function: Iterates through images in a specified folder, resizes and encodes them, generates question-answer pairs using the GPT-4 model, and appends the results to a CSV file. It skips already processed images, ensuring efficiency.
File Paths: The folder containing images and the output CSV file path are defined as input_folder_path and output_csv_path.
Execution: The code processes all images in the specified folder, generates the question-answer pairs, and saves them to a CSV file.

This code is suitable for creating a dataset of question-answer pairs based on image content, particularly for tasks involving information extraction from visual documents like contracts, IDs, and more.

Step 3: Uploading the Dataset to Hugging Face Hub

Now that we have the dataset in a CSV format, let’s prepare it for uploading to the Hugging Face Hub.

import os
import pandas as pd
from datasets import Dataset, Features, Image, Value
from huggingface_hub import HfApi

Read the CSV and group the data:

# Read the CSV file
csv_file_path = 'output.csv'  # Replace with the path to your uploaded CSV file
images_folder_path = 'images'  # Replace with the path to your images folder
df = pd.read_csv(csv_file_path)

grouped_data = df.groupby('Image Name')

Prepare Data for Hugging Face Dataset:

data_list = []
for image_name, group in grouped_data:
    messages = []
    # Load the image path (no need to open the image in PIL here)
    image_path = os.path.normpath(os.path.join(images_folder_path, image_name))

    for idx, row in group.iterrows():
        # Add user message with <image> tag to ALL user questions
        user_message = row['Question'] + "<image>"  # Add <image> to every question
        messages.append({"role": "user", "content": user_message})
        messages.append({"role": "assistant", "content": row['Answer']})

    entry = {
        "messages": messages,
        "images": [image_path] * len(group)  # Store the image path for each question
    }
    data_list.append(entry)

# Define dataset features
features = Features({
    'messages': [{'role': Value('string'), 'content': Value('string')}],  # Image is no longer optional
    'images': [Image()]  # Specify the 'images' feature as a list of Image type
})

# Convert to a Hugging Face Dataset
dataset = Dataset.from_list(data_list, features=features)

Inserting the <image> tags is an important step. You will encounter “ValueError: The number of images does not match the number of <image> tokens” error without it. The error indicates that there is a mismatch between the number of images provided and the number of <image> tokens in your dataset's messages. This often happens when the structure of the data doesn't align as expected during processing.

Upload to Hugging Face Hub:

# Define Hugging Face repository details
dataset_repo_id = "hfusername/DOCVQA-dataset"  # Replace with your Hugging Face username and dataset name
hf_token = ""  # Replace with your Hugging Face token

# Push the dataset to Hugging Face Hub
api = HfApi()
api.create_repo(repo_id=dataset_repo_id, token=hf_token, repo_type="dataset", exist_ok=True, private=True)
dataset.push_to_hub(dataset_repo_id, token=hf_token)
print(f"Dataset has been uploaded to {dataset_repo_id} on the Hugging Face Hub.")

# Optional: Print a sample to verify the structure
print("\nSample entry from the dataset:")
print(dataset[1])

Step 4: Fine-Tuning Qwen-2-VL with LLaMA-Factory

Now that the dataset is ready and uploaded, let’s configure the fine-tuning process using LLaMA-Factory.

Colab notebook: link

Update dataset_info.json: In the LLaMAFactory/data/dataset_info.json, add:

"mycustomdataset": {
    "hf_hub_url": "your_username/DOCVQA-dataset",
    "formatting": "sharegpt",
    "columns": {
        "messages": "messages",
        "images": "images"
    },
    "tags": {
        "role_tag": "role",
        "content_tag": "content",
        "user_tag": "user",
        "assistant_tag": "assistant"
    }
}

Fine-Tuning Script: Update dataset argument in the args in the script to fine-tune Qwen-2-VL:

args = dict(
    stage="sft",
    do_train=True,
    model_name_or_path="Qwen/Qwen2-VL-2B-Instruct",
    dataset="mycustomdataset",
    template="qwen2_vl",
    finetuning_type="lora",
    lora_target="all",
    output_dir="qwen2vl_lora",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    lr_scheduler_type="cosine",
    logging_steps=10,
    warmup_ratio=0.1,
    save_steps=1000,
    learning_rate=5e-5,
    num_train_epochs=3.0,
    max_samples=500,
    max_grad_norm=1.0,
    loraplus_lr_ratio=16.0,
    fp16=True,
    use_liger_kernel=True,
)

Run the Script: Execute the script to start the fine-tuning process. Adjust the hyperparameters as needed for optimal performance.

Conclusion

By following this guide, you now have a custom vision-language dataset and a setup to fine-tune the Qwen-2-VL model using LLaMA-Factory. This process is adaptable to various vision-language tasks beyond document VQA, making it a versatile approach for building specialized models.

Happy fine-tuning, and may your models reach new heights of performance!

#VisionLanguageModels #Qwen2VL #MachineLearning #AITraining #DatasetCreation #FineTuning #LLMs #LLaMAFactory #DOCVQA #AIResearch #DataScience #NLP #ComputerVision #AIModelTraining #CustomDataset

Personal Note

I wrote this article because, when I needed to create a vision-language dataset for fine-tuning Qwen_2_VL, I found a lack of resources on how to build a custom dataset for this specific task. Most guides covered the fine-tuning process but skipped over the critical step of dataset creation. Through trial and error, I learned how to adapt tools like the LLaMA-Factory script and wanted to share this knowledge to help others facing similar challenges.

Don’t forget to follow me on Medium and Twitter (@ashokpoudel) for more updates on similar articles. I write and talk about Web Development | Senior Technical Manager | Generative AI Enthusiast. Don’t hesitate to reach out: https://www.linkedin.com/in/ashokpoudel/