Fine-Tuning Qwen 2.5 7B for Python Code Generation using LoRA

Abstract

This project details the fine-tuning of the Qwen 2.5 7B large language model for Python code generation tasks. Leveraging Parameter-Efficient Fine-Tuning (PEFT) techniques, specifically LoRA (Low-Rank Adaptation), we explore the impact of varying LoRA hyperparameters—rank (r) and alpha (α)—on model performance. Our experiments, conducted on the flytech/python-codes-25k dataset, evaluate model efficacy using metrics such as evaluation loss, BLEU score, similarity score, and execution success rate. Results indicate that strategic selection of LoRA hyperparameters can lead to significant improvements in code generation quality, with certain configurations achieving up to 93.34% execution success rate and favorable BLEU scores, demonstrating the potential for creating capable code-generating LLMs even with limited computational resources.

Motivation & Problem Statement

The advent of large language models has revolutionized code generation, promising to enhance developer productivity dramatically. However, fine-tuning large pre-trained LLMs for specific tasks often demands substantial computational resources.

The Challenge

Resource Constraints: Full fine-tuning of 7B+ parameter models requires significant GPU memory and compute time
Task Specialization: General-purpose LLMs need adaptation for domain-specific code generation
Quality vs. Efficiency: Balancing model performance with computational efficiency
Evaluation Complexity: Code correctness requires more than lexical similarity metrics

Our Solution: LoRA

Parameter-Efficient Fine-Tuning (PEFT) methods, specifically LoRA (Low-Rank Adaptation), offer a viable solution by:

Injecting trainable rank decomposition matrices into transformer architecture
Significantly reducing trainable parameters (typically 0.1-1% of original model)
Maintaining or improving performance compared to full fine-tuning
Enabling fine-tuning on consumer-grade hardware

Key Contributions

Systematic Hyperparameter Search

Comprehensive exploration of LoRA rank (r) and alpha (α) parameters to identify optimal configurations for Python code generation.

Multi-Metric Evaluation

Holistic assessment using evaluation loss, BLEU score, similarity metrics, and critically, execution success rate for functional correctness.

Extended Training Analysis

Deep investigation of promising configurations across multiple epochs (up to 3175 steps) to understand learning dynamics.

Practical Code Generation

Demonstrated capability to generate syntactically correct and functionally accurate Python code for common programming tasks.

Methodology

Model Architecture

Base Model

Qwen 2.5 7B - Selected for strong general-purpose capabilities and open-source availability, providing solid foundation for specialized code generation.

Fine-Tuning Method

LoRA - Low-Rank Adaptation injects trainable matrices into attention layers, reducing parameters from 7B to ~10-50M trainable weights.

LoRA Hyperparameters

Rank (r)

Dimensionality of low-rank matrices. Higher rank implies more parameters and potentially greater expressiveness, but increased computational cost. Tested values: {1, 2, 4, 8, 16, 32}

Alpha (α)

Scaling factor for LoRA updates, determining magnitude of adaptation. Effective learning rate proportional to α/r. Tested values: {1, 2, 4, 8, 16, 32, 64}

Dataset

flytech/python-codes-25k: A curated dataset of Python code snippets enabling the model to learn syntax, structure, and common programming patterns specific to Python.

Evaluation Metrics

Metric	Description	Purpose
Evaluation Loss	Model prediction quality on unseen data	Generalization capability
BLEU Score	Bilingual Evaluation Understudy	Syntactic similarity to reference
Similarity Score	Semantic similarity using embeddings	Conceptual alignment
Execution Success Rate	Percentage of executable code	Functional correctness (CRITICAL)

Experimental Results Comprehensive

Initial Hyperparameter Search (1 Epoch, 625 Steps)

Systematic exploration of 30+ LoRA configurations revealed important trade-offs between evaluation metrics and functional correctness:

Rank (r)	Alpha (α)	α/r Ratio	Eval Loss	Avg BLEU	Avg Similarity
1	2	2	0.5500	29.82	36.12
2	4	2	0.5374	28.96	36.03
4	8	2	0.5209	28.38	34.86
8	16	2	0.5035	28.12	34.85
16	32	2	0.4883	32.62	38.04
32	32	1	0.4798	32.88	40.15
32	64	2	0.4740	31.36	36.84

Key Findings from Initial Search

Best Loss: r=32, α=32

Evaluation Loss 0.4798

Avg BLEU 32.88

Avg Similarity 40.15

Execution Rate 73.34%

Best Execution: r=1, α=2

Evaluation Loss 0.5500

Avg BLEU 29.82

Avg Similarity 36.12

Execution Rate 93.34%

Balanced: r=16, α=32

Evaluation Loss 0.4883

Avg BLEU 32.62

Avg Similarity 38.04

Execution Rate ~80%

Extended Training Results (Up to 3175 Steps)

Selected configurations were trained for multiple epochs to analyze learning dynamics and identify optimal stopping points:

Rank	Alpha	Step	Eval Loss	Train Loss	Execution Rate
1	2	2500	0.4728	0.3873	86.67%
8	16	1270	0.3946	0.3406	90.00%
4	16	1905	0.4446	0.4212	83.33%
16	32	1270	0.3527	0.2813	86.67%
32	64	635	0.4404	0.4250	86.67%
32	64	2500	0.1895	0.1018	56.67%

Critical Observations

Trade-off Discovered

Lower evaluation loss doesn't guarantee higher execution success. Config r=32, α=64 achieved lowest eval loss (0.1895) but only 56.67% execution rate.

Optimal Configuration

r=8, α=16 at step 1270 achieved highest execution success (90.00%) with good eval loss (0.3946), suggesting sweet spot for practical code generation.

Early Stopping Benefits

Many configurations reached peak execution rates early in training (steps 635-1270), suggesting excessive training may hurt functional correctness despite improving loss metrics.

Rank Impact

Smaller ranks (r=1, 2, 4) maintained high execution rates but had higher eval loss. Larger ranks (r=16, 32) achieved lower loss but execution rates varied significantly.

Generated Code Samples Functional

The following samples demonstrate the model's capability to generate syntactically correct and functionally accurate Python code:

Sample 1: Factorial Function

Prompt: "Write a Python function to calculate factorial"

def factorial(n):
    if n == 0 or n == 1:
        return 1
    else:
        return (n * factorial(n-1))

Analysis: Correctly implements recursive factorial with proper base cases for 0 and 1.

Sample 2: Reverse String Function

Prompt: "Create a function to reverse a string"

def reverse_string(string):
    reversed_str = ""
    for char in string:
        reversed_str = char + reversed_str
    return reversed_str

Analysis: Iterative approach building reversed string character by character, demonstrating clear string manipulation logic.

Sample 3: Prime Number Check

Prompt: "Write a function to check if a number is prime"

def is_prime(n):
    if n <= 1:
        return False
    for i in range(2, n):
        if n % i == 0:
            return False
    return True

Analysis: Standard primality test algorithm with correct edge case handling (n ≤ 1).

Sample 4: Binary Search

Prompt: "Implement binary search in Python"

def binary_search(array, target):
    left = 0
    right = len(array) - 1
    
    while left <= right:
        middle = (left + right) // 2
        if array[middle] == target:
            return middle
        elif array[middle] < target:
            left = middle + 1
        else:
            right = middle - 1
    
    return False

Analysis: Correctly implements iterative binary search with proper pointer management and termination conditions.

Sample 5: Longest Common Substring (Advanced)

Prompt: "Write a function to find the longest common substring"

def longest_common_substring(str1, str2):
    # Create 2D matrix for dynamic programming
    matrix = [[0]*(len(str2) + 1) for _ in range(len(str1) + 1)]
    max_length = 0
    end_idx_str1 = 0
    
    # Fill matrix using dynamic programming
    for i in range(1, len(str1) + 1):
        for j in range(1, len(str2) + 1):
            if str1[i-1] == str2[j-1]:
                matrix[i][j] = matrix[i-1][j-1] + 1
                if matrix[i][j] > max_length:
                    max_length = matrix[i][j]
                    end_idx_str1 = i
            else:
                matrix[i][j] = 0
    
    # Extract longest common substring
    if max_length == 0:
        return ""
    else:
        start_idx_str1 = end_idx_str1 - max_length
        return str1[start_idx_str1:end_idx_str1]

Analysis: Sophisticated implementation using dynamic programming with proper matrix initialization, filling logic, and substring extraction. Demonstrates model's capability with complex algorithms.

Discussion & Analysis

Key Insights

Metric Divergence

Evaluation loss and BLEU scores don't fully capture code correctness. Minor syntactic errors can cause execution failure despite good similarity scores, highlighting need for execution-based evaluation.

Hyperparameter Trade-offs

Higher rank configurations (r=16, 32) learn patterns faster (lower loss) but may overfit to dataset syntax rather than generalizable code structure, reducing execution success.

Practical Recommendations

For production code generation, moderate rank (r=4-8) with proportional alpha (α=2r) provides best balance of learning capability and functional correctness.

Training Duration

Peak execution rates often achieved early (635-1270 steps). Extended training improves loss metrics but may degrade practical utility, suggesting early stopping based on execution rate.

Detailed Performance Analysis

Configuration r=1, α=2: Despite highest eval loss (0.5500), achieved exceptional 93.34% execution rate in initial sweep and maintained 86.67% at extended training. Suggests minimal parameter adaptation sufficient for functional code generation.

Configuration r=8, α=16: Emerged as optimal balance—90.00% execution rate with reasonable eval loss (0.3946) at step 1270. This configuration provides sufficient expressiveness without overfitting.

Configuration r=32, α=64: Achieved lowest eval loss (0.1895) but execution rate dropped to 56.67%, indicating potential overfitting to training data patterns rather than learning generalizable code structures.

Implications for Code Generation

Execution Rate as Primary Metric: For code generation tasks, functional correctness (execution success) should be weighted more heavily than lexical similarity metrics
Early Stopping Strategy: Monitor execution rate on validation set; stop when it plateaus or begins declining, even if loss continues improving
Conservative Hyperparameters: Smaller LoRA ranks may be preferable for code generation compared to text generation tasks
Dataset Quality: Model learns patterns from training data—high-quality, diverse code examples crucial for generalization

Implementation Details

Training Configuration

Model: Qwen 2.5 7B
Dataset: flytech/python-codes-25k
Training Steps: 625 (1 epoch) to 3175 (5 epochs)
Batch Size: Adaptive based on GPU memory
Learning Rate: Adaptive with LoRA α/r scaling
Optimizer: AdamW
Target Modules: Query, Key, Value projections in attention layers

LoRA Architecture

For attention weight matrix W ∈ R^(d×k):
  W' = W + BA
where:
  B ∈ R^(d×r) - Down-projection matrix (trainable)
  A ∈ R^(r×k) - Up-projection matrix (trainable)
  r << min(d,k) - Rank (hyperparameter)
  
Effective parameter reduction:
  Original: d × k parameters
  LoRA: (d + k) × r parameters
  Ratio: ~1-2% for r=8, d=4096, k=4096

Future Work & Extensions

qLoRA Implementation In Progress

Incorporate 4-bit/8-bit quantization using Unsloth library to further reduce memory footprint and enable fine-tuning on consumer GPUs with limited VRAM.

Multi-Model Testing Planned

Extend methodology to Llama 3.2, Mistral, and CodeLlama to assess generalizability of findings across different model architectures.

Advanced Evaluation Planned

Implement Pass@k metric, CodeBERTScore, and comprehensive test case execution across diverse problem sets beyond simple function generation.

Dataset Expansion Planned

Utilize larger datasets (nvidia/OpenCodeInstruct, jtatman/python-code-dataset-500k) for improved coverage of programming paradigms and edge cases.

LoRA+ Investigation In Progress

Explore LoRA+ variant with differential learning rates for A and B matrices to potentially improve adaptation efficiency.

RLHF Integration Research

Investigate Reinforcement Learning from Human Feedback to align generated code with developer preferences and best practices.

Research Directions

Automated Hyperparameter Selection: Develop algorithms to automatically select optimal LoRA rank based on task complexity and dataset characteristics
Multi-Task Fine-Tuning: Explore joint training on multiple programming languages or coding tasks with shared LoRA parameters
Iterative Refinement: Implement self-correction mechanisms where model evaluates and refines its own generated code
Domain-Specific Adaptation: Fine-tune for specialized domains (data science, web development, algorithm implementation)
Error Analysis: Systematic categorization of generation failures to guide targeted improvements

Resources & Links

LoRA Paper: Hu et al. "LoRA: Low-Rank Adaptation of Large Language Models" (arXiv:2106.09685)
Qwen 2.5: Official Documentation
PEFT Library: Hugging Face Parameter-Efficient Fine-Tuning
Code Generation Datasets: Awesome LLM Datasets

Conclusion

This project successfully demonstrates that Parameter-Efficient Fine-Tuning using LoRA can create highly capable code-generating LLMs with minimal computational resources. Through systematic exploration of hyperparameters, we identified configurations achieving up to 93.34% execution success rate on Python code generation tasks.

The key finding is that functional correctness (execution rate) and evaluation loss metrics can diverge significantly in code generation tasks. Configuration r=8, α=16 emerged as optimal, achieving 90% execution success with good generalization at relatively early training steps (1270 steps).

Our results highlight that smaller LoRA ranks (r=4-8) with proportional alpha scaling often provide the best balance for practical code generation, suggesting that extensive parameter adaptation may not be necessary—and potentially harmful—for achieving functional correctness. These insights provide valuable guidance for developing efficient, resource-conscious code-generating AI systems.

Impact & Applications

Developer Productivity: Automated code generation for routine programming tasks
Education: Learning aid for programming students with instant code examples
Code Completion: Enhanced IDE integration for context-aware suggestions
Rapid Prototyping: Accelerated development cycles with AI-assisted coding
Accessible AI: Demonstrating that effective LLM fine-tuning is achievable with limited resources

See All Projects of Mine →