Fine-Tuning Qwen 2.5 7B for Python Code Generation using LoRA

Parameter-Efficient Fine-Tuning with Low-Rank Adaptation

Mahanth Yalla

M.Tech Artificial Intelligence

Indian Institute of Science, Bengaluru

Abstract

This project details the fine-tuning of the Qwen 2.5 7B large language model for Python code generation tasks. Leveraging Parameter-Efficient Fine-Tuning (PEFT) techniques, specifically LoRA (Low-Rank Adaptation), we explore the impact of varying LoRA hyperparameters—rank (r) and alpha (α)—on model performance. Our experiments, conducted on the flytech/python-codes-25k dataset, evaluate model efficacy using metrics such as evaluation loss, BLEU score, similarity score, and execution success rate. Results indicate that strategic selection of LoRA hyperparameters can lead to significant improvements in code generation quality, with certain configurations achieving up to 93.34% execution success rate and favorable BLEU scores, demonstrating the potential for creating capable code-generating LLMs even with limited computational resources.

Motivation & Problem Statement

The advent of large language models has revolutionized code generation, promising to enhance developer productivity dramatically. However, fine-tuning large pre-trained LLMs for specific tasks often demands substantial computational resources.

The Challenge

Our Solution: LoRA

Parameter-Efficient Fine-Tuning (PEFT) methods, specifically LoRA (Low-Rank Adaptation), offer a viable solution by:

Key Contributions

Systematic Hyperparameter Search

Comprehensive exploration of LoRA rank (r) and alpha (α) parameters to identify optimal configurations for Python code generation.

Multi-Metric Evaluation

Holistic assessment using evaluation loss, BLEU score, similarity metrics, and critically, execution success rate for functional correctness.

Extended Training Analysis

Deep investigation of promising configurations across multiple epochs (up to 3175 steps) to understand learning dynamics.

Practical Code Generation

Demonstrated capability to generate syntactically correct and functionally accurate Python code for common programming tasks.

Methodology

Model Architecture

Base Model

Qwen 2.5 7B - Selected for strong general-purpose capabilities and open-source availability, providing solid foundation for specialized code generation.

Fine-Tuning Method

LoRA - Low-Rank Adaptation injects trainable matrices into attention layers, reducing parameters from 7B to ~10-50M trainable weights.

LoRA Hyperparameters

r

Rank (r)

Dimensionality of low-rank matrices. Higher rank implies more parameters and potentially greater expressiveness, but increased computational cost. Tested values: {1, 2, 4, 8, 16, 32}

α

Alpha (α)

Scaling factor for LoRA updates, determining magnitude of adaptation. Effective learning rate proportional to α/r. Tested values: {1, 2, 4, 8, 16, 32, 64}

Dataset

flytech/python-codes-25k: A curated dataset of Python code snippets enabling the model to learn syntax, structure, and common programming patterns specific to Python.

Evaluation Metrics

Metric Description Purpose
Evaluation Loss Model prediction quality on unseen data Generalization capability
BLEU Score Bilingual Evaluation Understudy Syntactic similarity to reference
Similarity Score Semantic similarity using embeddings Conceptual alignment
Execution Success Rate Percentage of executable code Functional correctness (CRITICAL)

Experimental Results Comprehensive

Initial Hyperparameter Search (1 Epoch, 625 Steps)

Systematic exploration of 30+ LoRA configurations revealed important trade-offs between evaluation metrics and functional correctness:

Rank (r) Alpha (α) α/r Ratio Eval Loss Avg BLEU Avg Similarity
1 2 2 0.5500 29.82 36.12
2 4 2 0.5374 28.96 36.03
4 8 2 0.5209 28.38 34.86
8 16 2 0.5035 28.12 34.85
16 32 2 0.4883 32.62 38.04
32 32 1 0.4798 32.88 40.15
32 64 2 0.4740 31.36 36.84

Key Findings from Initial Search

Best Loss: r=32, α=32

Evaluation Loss 0.4798
Avg BLEU 32.88
Avg Similarity 40.15
Execution Rate 73.34%

Best Execution: r=1, α=2

Evaluation Loss 0.5500
Avg BLEU 29.82
Avg Similarity 36.12
Execution Rate 93.34%

Balanced: r=16, α=32

Evaluation Loss 0.4883
Avg BLEU 32.62
Avg Similarity 38.04
Execution Rate ~80%

Extended Training Results (Up to 3175 Steps)

Selected configurations were trained for multiple epochs to analyze learning dynamics and identify optimal stopping points:

Rank Alpha Step Eval Loss Train Loss Execution Rate
1 2 2500 0.4728 0.3873 86.67%
8 16 1270 0.3946 0.3406 90.00%
4 16 1905 0.4446 0.4212 83.33%
16 32 1270 0.3527 0.2813 86.67%
32 64 635 0.4404 0.4250 86.67%
32 64 2500 0.1895 0.1018 56.67%

Critical Observations

Trade-off Discovered

Lower evaluation loss doesn't guarantee higher execution success. Config r=32, α=64 achieved lowest eval loss (0.1895) but only 56.67% execution rate.

Optimal Configuration

r=8, α=16 at step 1270 achieved highest execution success (90.00%) with good eval loss (0.3946), suggesting sweet spot for practical code generation.

Early Stopping Benefits

Many configurations reached peak execution rates early in training (steps 635-1270), suggesting excessive training may hurt functional correctness despite improving loss metrics.

Rank Impact

Smaller ranks (r=1, 2, 4) maintained high execution rates but had higher eval loss. Larger ranks (r=16, 32) achieved lower loss but execution rates varied significantly.

Generated Code Samples Functional

The following samples demonstrate the model's capability to generate syntactically correct and functionally accurate Python code:

Sample 1: Factorial Function

Prompt: "Write a Python function to calculate factorial"
def factorial(n):
    if n == 0 or n == 1:
        return 1
    else:
        return (n * factorial(n-1))

Analysis: Correctly implements recursive factorial with proper base cases for 0 and 1.

Sample 2: Reverse String Function

Prompt: "Create a function to reverse a string"
def reverse_string(string):
    reversed_str = ""
    for char in string:
        reversed_str = char + reversed_str
    return reversed_str

Analysis: Iterative approach building reversed string character by character, demonstrating clear string manipulation logic.

Sample 3: Prime Number Check

Prompt: "Write a function to check if a number is prime"
def is_prime(n):
    if n <= 1:
        return False
    for i in range(2, n):
        if n % i == 0:
            return False
    return True

Analysis: Standard primality test algorithm with correct edge case handling (n ≤ 1).

Sample 4: Binary Search

Prompt: "Implement binary search in Python"
def binary_search(array, target):
    left = 0
    right = len(array) - 1
    
    while left <= right:
        middle = (left + right) // 2
        if array[middle] == target:
            return middle
        elif array[middle] < target:
            left = middle + 1
        else:
            right = middle - 1
    
    return False

Analysis: Correctly implements iterative binary search with proper pointer management and termination conditions.

Sample 5: Longest Common Substring (Advanced)

Prompt: "Write a function to find the longest common substring"
def longest_common_substring(str1, str2):
    # Create 2D matrix for dynamic programming
    matrix = [[0]*(len(str2) + 1) for _ in range(len(str1) + 1)]
    max_length = 0
    end_idx_str1 = 0
    
    # Fill matrix using dynamic programming
    for i in range(1, len(str1) + 1):
        for j in range(1, len(str2) + 1):
            if str1[i-1] == str2[j-1]:
                matrix[i][j] = matrix[i-1][j-1] + 1
                if matrix[i][j] > max_length:
                    max_length = matrix[i][j]
                    end_idx_str1 = i
            else:
                matrix[i][j] = 0
    
    # Extract longest common substring
    if max_length == 0:
        return ""
    else:
        start_idx_str1 = end_idx_str1 - max_length
        return str1[start_idx_str1:end_idx_str1]

Analysis: Sophisticated implementation using dynamic programming with proper matrix initialization, filling logic, and substring extraction. Demonstrates model's capability with complex algorithms.

Discussion & Analysis

Key Insights

Metric Divergence

Evaluation loss and BLEU scores don't fully capture code correctness. Minor syntactic errors can cause execution failure despite good similarity scores, highlighting need for execution-based evaluation.

Hyperparameter Trade-offs

Higher rank configurations (r=16, 32) learn patterns faster (lower loss) but may overfit to dataset syntax rather than generalizable code structure, reducing execution success.

Practical Recommendations

For production code generation, moderate rank (r=4-8) with proportional alpha (α=2r) provides best balance of learning capability and functional correctness.

Training Duration

Peak execution rates often achieved early (635-1270 steps). Extended training improves loss metrics but may degrade practical utility, suggesting early stopping based on execution rate.

Detailed Performance Analysis

Configuration r=1, α=2: Despite highest eval loss (0.5500), achieved exceptional 93.34% execution rate in initial sweep and maintained 86.67% at extended training. Suggests minimal parameter adaptation sufficient for functional code generation.

Configuration r=8, α=16: Emerged as optimal balance—90.00% execution rate with reasonable eval loss (0.3946) at step 1270. This configuration provides sufficient expressiveness without overfitting.

Configuration r=32, α=64: Achieved lowest eval loss (0.1895) but execution rate dropped to 56.67%, indicating potential overfitting to training data patterns rather than learning generalizable code structures.

Implications for Code Generation

Implementation Details

Training Configuration

Model: Qwen 2.5 7B
Dataset: flytech/python-codes-25k
Training Steps: 625 (1 epoch) to 3175 (5 epochs)
Batch Size: Adaptive based on GPU memory
Learning Rate: Adaptive with LoRA α/r scaling
Optimizer: AdamW
Target Modules: Query, Key, Value projections in attention layers

LoRA Architecture

For attention weight matrix W ∈ R^(d×k):
  W' = W + BA
where:
  B ∈ R^(d×r) - Down-projection matrix (trainable)
  A ∈ R^(r×k) - Up-projection matrix (trainable)
  r << min(d,k) - Rank (hyperparameter)
  
Effective parameter reduction:
  Original: d × k parameters
  LoRA: (d + k) × r parameters
  Ratio: ~1-2% for r=8, d=4096, k=4096

Future Work & Extensions

qLoRA Implementation In Progress

Incorporate 4-bit/8-bit quantization using Unsloth library to further reduce memory footprint and enable fine-tuning on consumer GPUs with limited VRAM.

Multi-Model Testing Planned

Extend methodology to Llama 3.2, Mistral, and CodeLlama to assess generalizability of findings across different model architectures.

Advanced Evaluation Planned

Implement Pass@k metric, CodeBERTScore, and comprehensive test case execution across diverse problem sets beyond simple function generation.

Dataset Expansion Planned

Utilize larger datasets (nvidia/OpenCodeInstruct, jtatman/python-code-dataset-500k) for improved coverage of programming paradigms and edge cases.

LoRA+ Investigation In Progress

Explore LoRA+ variant with differential learning rates for A and B matrices to potentially improve adaptation efficiency.

RLHF Integration Research

Investigate Reinforcement Learning from Human Feedback to align generated code with developer preferences and best practices.

Research Directions

  • Automated Hyperparameter Selection: Develop algorithms to automatically select optimal LoRA rank based on task complexity and dataset characteristics
  • Multi-Task Fine-Tuning: Explore joint training on multiple programming languages or coding tasks with shared LoRA parameters
  • Iterative Refinement: Implement self-correction mechanisms where model evaluates and refines its own generated code
  • Domain-Specific Adaptation: Fine-tune for specialized domains (data science, web development, algorithm implementation)
  • Error Analysis: Systematic categorization of generation failures to guide targeted improvements

Resources & Links

  • LoRA Paper: Hu et al. "LoRA: Low-Rank Adaptation of Large Language Models" (arXiv:2106.09685)
  • Qwen 2.5: Official Documentation
  • PEFT Library: Hugging Face Parameter-Efficient Fine-Tuning
  • Code Generation Datasets: Awesome LLM Datasets

Conclusion

This project successfully demonstrates that Parameter-Efficient Fine-Tuning using LoRA can create highly capable code-generating LLMs with minimal computational resources. Through systematic exploration of hyperparameters, we identified configurations achieving up to 93.34% execution success rate on Python code generation tasks.

The key finding is that functional correctness (execution rate) and evaluation loss metrics can diverge significantly in code generation tasks. Configuration r=8, α=16 emerged as optimal, achieving 90% execution success with good generalization at relatively early training steps (1270 steps).

Our results highlight that smaller LoRA ranks (r=4-8) with proportional alpha scaling often provide the best balance for practical code generation, suggesting that extensive parameter adaptation may not be necessary—and potentially harmful—for achieving functional correctness. These insights provide valuable guidance for developing efficient, resource-conscious code-generating AI systems.

Impact & Applications

  • Developer Productivity: Automated code generation for routine programming tasks
  • Education: Learning aid for programming students with instant code examples
  • Code Completion: Enhanced IDE integration for context-aware suggestions
  • Rapid Prototyping: Accelerated development cycles with AI-assisted coding
  • Accessible AI: Demonstrating that effective LLM fine-tuning is achievable with limited resources
See All Projects of Mine →