Abstract
Motivation & Problem Statement
The advent of large language models has revolutionized code generation, promising to enhance developer productivity dramatically. However, fine-tuning large pre-trained LLMs for specific tasks often demands substantial computational resources.
The Challenge
- Resource Constraints: Full fine-tuning of 7B+ parameter models requires significant GPU memory and compute time
- Task Specialization: General-purpose LLMs need adaptation for domain-specific code generation
- Quality vs. Efficiency: Balancing model performance with computational efficiency
- Evaluation Complexity: Code correctness requires more than lexical similarity metrics
Our Solution: LoRA
Parameter-Efficient Fine-Tuning (PEFT) methods, specifically LoRA (Low-Rank Adaptation), offer a viable solution by:
- Injecting trainable rank decomposition matrices into transformer architecture
- Significantly reducing trainable parameters (typically 0.1-1% of original model)
- Maintaining or improving performance compared to full fine-tuning
- Enabling fine-tuning on consumer-grade hardware
Key Contributions
Systematic Hyperparameter Search
Comprehensive exploration of LoRA rank (r) and alpha (α) parameters to identify optimal configurations for Python code generation.
Multi-Metric Evaluation
Holistic assessment using evaluation loss, BLEU score, similarity metrics, and critically, execution success rate for functional correctness.
Extended Training Analysis
Deep investigation of promising configurations across multiple epochs (up to 3175 steps) to understand learning dynamics.
Practical Code Generation
Demonstrated capability to generate syntactically correct and functionally accurate Python code for common programming tasks.
Methodology
Model Architecture
Base Model
Qwen 2.5 7B - Selected for strong general-purpose capabilities and open-source availability, providing solid foundation for specialized code generation.
Fine-Tuning Method
LoRA - Low-Rank Adaptation injects trainable matrices into attention layers, reducing parameters from 7B to ~10-50M trainable weights.
LoRA Hyperparameters
Rank (r)
Dimensionality of low-rank matrices. Higher rank implies more parameters and potentially greater expressiveness, but increased computational cost. Tested values: {1, 2, 4, 8, 16, 32}
Alpha (α)
Scaling factor for LoRA updates, determining magnitude of adaptation. Effective learning rate proportional to α/r. Tested values: {1, 2, 4, 8, 16, 32, 64}
Dataset
flytech/python-codes-25k: A curated dataset of Python code snippets enabling the model to learn syntax, structure, and common programming patterns specific to Python.
Evaluation Metrics
| Metric | Description | Purpose |
|---|---|---|
| Evaluation Loss | Model prediction quality on unseen data | Generalization capability |
| BLEU Score | Bilingual Evaluation Understudy | Syntactic similarity to reference |
| Similarity Score | Semantic similarity using embeddings | Conceptual alignment |
| Execution Success Rate | Percentage of executable code | Functional correctness (CRITICAL) |
Experimental Results Comprehensive
Initial Hyperparameter Search (1 Epoch, 625 Steps)
Systematic exploration of 30+ LoRA configurations revealed important trade-offs between evaluation metrics and functional correctness:
| Rank (r) | Alpha (α) | α/r Ratio | Eval Loss | Avg BLEU | Avg Similarity |
|---|---|---|---|---|---|
| 1 | 2 | 2 | 0.5500 | 29.82 | 36.12 |
| 2 | 4 | 2 | 0.5374 | 28.96 | 36.03 |
| 4 | 8 | 2 | 0.5209 | 28.38 | 34.86 |
| 8 | 16 | 2 | 0.5035 | 28.12 | 34.85 |
| 16 | 32 | 2 | 0.4883 | 32.62 | 38.04 |
| 32 | 32 | 1 | 0.4798 | 32.88 | 40.15 |
| 32 | 64 | 2 | 0.4740 | 31.36 | 36.84 |
Key Findings from Initial Search
Best Loss: r=32, α=32
Best Execution: r=1, α=2
Balanced: r=16, α=32
Extended Training Results (Up to 3175 Steps)
Selected configurations were trained for multiple epochs to analyze learning dynamics and identify optimal stopping points:
| Rank | Alpha | Step | Eval Loss | Train Loss | Execution Rate |
|---|---|---|---|---|---|
| 1 | 2 | 2500 | 0.4728 | 0.3873 | 86.67% |
| 8 | 16 | 1270 | 0.3946 | 0.3406 | 90.00% |
| 4 | 16 | 1905 | 0.4446 | 0.4212 | 83.33% |
| 16 | 32 | 1270 | 0.3527 | 0.2813 | 86.67% |
| 32 | 64 | 635 | 0.4404 | 0.4250 | 86.67% |
| 32 | 64 | 2500 | 0.1895 | 0.1018 | 56.67% |
Critical Observations
Trade-off Discovered
Lower evaluation loss doesn't guarantee higher execution success. Config r=32, α=64 achieved lowest eval loss (0.1895) but only 56.67% execution rate.
Optimal Configuration
r=8, α=16 at step 1270 achieved highest execution success (90.00%) with good eval loss (0.3946), suggesting sweet spot for practical code generation.
Early Stopping Benefits
Many configurations reached peak execution rates early in training (steps 635-1270), suggesting excessive training may hurt functional correctness despite improving loss metrics.
Rank Impact
Smaller ranks (r=1, 2, 4) maintained high execution rates but had higher eval loss. Larger ranks (r=16, 32) achieved lower loss but execution rates varied significantly.
Generated Code Samples Functional
The following samples demonstrate the model's capability to generate syntactically correct and functionally accurate Python code:
Sample 1: Factorial Function
def factorial(n):
if n == 0 or n == 1:
return 1
else:
return (n * factorial(n-1))
Analysis: Correctly implements recursive factorial with proper base cases for 0 and 1.
Sample 2: Reverse String Function
def reverse_string(string):
reversed_str = ""
for char in string:
reversed_str = char + reversed_str
return reversed_str
Analysis: Iterative approach building reversed string character by character, demonstrating clear string manipulation logic.
Sample 3: Prime Number Check
def is_prime(n):
if n <= 1:
return False
for i in range(2, n):
if n % i == 0:
return False
return True
Analysis: Standard primality test algorithm with correct edge case handling (n ≤ 1).
Sample 4: Binary Search
def binary_search(array, target):
left = 0
right = len(array) - 1
while left <= right:
middle = (left + right) // 2
if array[middle] == target:
return middle
elif array[middle] < target:
left = middle + 1
else:
right = middle - 1
return False
Analysis: Correctly implements iterative binary search with proper pointer management and termination conditions.
Sample 5: Longest Common Substring (Advanced)
def longest_common_substring(str1, str2):
# Create 2D matrix for dynamic programming
matrix = [[0]*(len(str2) + 1) for _ in range(len(str1) + 1)]
max_length = 0
end_idx_str1 = 0
# Fill matrix using dynamic programming
for i in range(1, len(str1) + 1):
for j in range(1, len(str2) + 1):
if str1[i-1] == str2[j-1]:
matrix[i][j] = matrix[i-1][j-1] + 1
if matrix[i][j] > max_length:
max_length = matrix[i][j]
end_idx_str1 = i
else:
matrix[i][j] = 0
# Extract longest common substring
if max_length == 0:
return ""
else:
start_idx_str1 = end_idx_str1 - max_length
return str1[start_idx_str1:end_idx_str1]
Analysis: Sophisticated implementation using dynamic programming with proper matrix initialization, filling logic, and substring extraction. Demonstrates model's capability with complex algorithms.
Discussion & Analysis
Key Insights
Metric Divergence
Evaluation loss and BLEU scores don't fully capture code correctness. Minor syntactic errors can cause execution failure despite good similarity scores, highlighting need for execution-based evaluation.
Hyperparameter Trade-offs
Higher rank configurations (r=16, 32) learn patterns faster (lower loss) but may overfit to dataset syntax rather than generalizable code structure, reducing execution success.
Practical Recommendations
For production code generation, moderate rank (r=4-8) with proportional alpha (α=2r) provides best balance of learning capability and functional correctness.
Training Duration
Peak execution rates often achieved early (635-1270 steps). Extended training improves loss metrics but may degrade practical utility, suggesting early stopping based on execution rate.
Detailed Performance Analysis
Configuration r=1, α=2: Despite highest eval loss (0.5500), achieved exceptional 93.34% execution rate in initial sweep and maintained 86.67% at extended training. Suggests minimal parameter adaptation sufficient for functional code generation.
Configuration r=8, α=16: Emerged as optimal balance—90.00% execution rate with reasonable eval loss (0.3946) at step 1270. This configuration provides sufficient expressiveness without overfitting.
Configuration r=32, α=64: Achieved lowest eval loss (0.1895) but execution rate dropped to 56.67%, indicating potential overfitting to training data patterns rather than learning generalizable code structures.
Implications for Code Generation
- Execution Rate as Primary Metric: For code generation tasks, functional correctness (execution success) should be weighted more heavily than lexical similarity metrics
- Early Stopping Strategy: Monitor execution rate on validation set; stop when it plateaus or begins declining, even if loss continues improving
- Conservative Hyperparameters: Smaller LoRA ranks may be preferable for code generation compared to text generation tasks
- Dataset Quality: Model learns patterns from training data—high-quality, diverse code examples crucial for generalization
Implementation Details
Training Configuration
Model: Qwen 2.5 7B Dataset: flytech/python-codes-25k Training Steps: 625 (1 epoch) to 3175 (5 epochs) Batch Size: Adaptive based on GPU memory Learning Rate: Adaptive with LoRA α/r scaling Optimizer: AdamW Target Modules: Query, Key, Value projections in attention layers
LoRA Architecture
For attention weight matrix W ∈ R^(d×k): W' = W + BA where: B ∈ R^(d×r) - Down-projection matrix (trainable) A ∈ R^(r×k) - Up-projection matrix (trainable) r << min(d,k) - Rank (hyperparameter) Effective parameter reduction: Original: d × k parameters LoRA: (d + k) × r parameters Ratio: ~1-2% for r=8, d=4096, k=4096
Future Work & Extensions
qLoRA Implementation In Progress
Incorporate 4-bit/8-bit quantization using Unsloth library to further reduce memory footprint and enable fine-tuning on consumer GPUs with limited VRAM.
Multi-Model Testing Planned
Extend methodology to Llama 3.2, Mistral, and CodeLlama to assess generalizability of findings across different model architectures.
Advanced Evaluation Planned
Implement Pass@k metric, CodeBERTScore, and comprehensive test case execution across diverse problem sets beyond simple function generation.
Dataset Expansion Planned
Utilize larger datasets (nvidia/OpenCodeInstruct, jtatman/python-code-dataset-500k) for improved coverage of programming paradigms and edge cases.
LoRA+ Investigation In Progress
Explore LoRA+ variant with differential learning rates for A and B matrices to potentially improve adaptation efficiency.
RLHF Integration Research
Investigate Reinforcement Learning from Human Feedback to align generated code with developer preferences and best practices.
Research Directions
- Automated Hyperparameter Selection: Develop algorithms to automatically select optimal LoRA rank based on task complexity and dataset characteristics
- Multi-Task Fine-Tuning: Explore joint training on multiple programming languages or coding tasks with shared LoRA parameters
- Iterative Refinement: Implement self-correction mechanisms where model evaluates and refines its own generated code
- Domain-Specific Adaptation: Fine-tune for specialized domains (data science, web development, algorithm implementation)
- Error Analysis: Systematic categorization of generation failures to guide targeted improvements
Resources & Links
- LoRA Paper: Hu et al. "LoRA: Low-Rank Adaptation of Large Language Models" (arXiv:2106.09685)
- Qwen 2.5: Official Documentation
- PEFT Library: Hugging Face Parameter-Efficient Fine-Tuning
- Code Generation Datasets: Awesome LLM Datasets
Conclusion
This project successfully demonstrates that Parameter-Efficient Fine-Tuning using LoRA can create highly capable code-generating LLMs with minimal computational resources. Through systematic exploration of hyperparameters, we identified configurations achieving up to 93.34% execution success rate on Python code generation tasks.
The key finding is that functional correctness (execution rate) and evaluation loss metrics can diverge significantly in code generation tasks. Configuration r=8, α=16 emerged as optimal, achieving 90% execution success with good generalization at relatively early training steps (1270 steps).
Our results highlight that smaller LoRA ranks (r=4-8) with proportional alpha scaling often provide the best balance for practical code generation, suggesting that extensive parameter adaptation may not be necessary—and potentially harmful—for achieving functional correctness. These insights provide valuable guidance for developing efficient, resource-conscious code-generating AI systems.
Impact & Applications
- Developer Productivity: Automated code generation for routine programming tasks
- Education: Learning aid for programming students with instant code examples
- Code Completion: Enhanced IDE integration for context-aware suggestions
- Rapid Prototyping: Accelerated development cycles with AI-assisted coding
- Accessible AI: Demonstrating that effective LLM fine-tuning is achievable with limited resources