Extending Kontext for Multiple Input Images and Scene Consistency
Project Overview
This project aims to extend Kontext to support multiple input images while maintaining scene consistency across generated outputs. The primary goal is to explore the model's capability to handle temporal consistency and generate coherent sequential shots.
Current Status
Work in Progress - No successful complete runs yet. Seeking feedback from the Kontext paper authors or community.
Technical Approach
Attempted Methods
1. Time Offset Encoding (3D RoPE)
- Approach: Used time offset in 3D RoPE to encode multiple context images, following suggestions from the original paper
- Results: Produced unexpected artifacts
- Documentation: WandB Run
2. Spatial Separation (3D RoPE)
- Approach: Maintained all context images at t=1 while separating them spatially in 3D RoPE
- Results: Improved performance compared to time offset approach
- Documentation: WandB Run
Key Challenges
Quality Degradation with Context Length
The most significant issue observed is the degradation of generated image quality as context length increases. This occurs even in the base model before any training modifications.
Open Questions for Discussion
Model Capacity: Is the quality degradation a fundamental model capacity limitation, or can it be addressed through training?
Training Strategy: Would fine-tuning specifically for longer context lengths resolve the quality issues?
Prompt Engineering: Could the text prompting approach be contributing to the degradation problem?
Dataset Requirements: What is the minimum dataset size needed to effectively train the model for:
- Next shot generation
- Temporal consistency maintenance
- Multi-image context handling
Seeking Feedback
Looking for insights from the Kontext paper authors, on:
- Best practices for multi-image context encoding
- Strategies to maintain quality at longer context lengths
- Dataset size and composition recommendations for this specific task
- Alternative approaches to achieving scene consistency
Next Steps
- Continue experimenting with different RoPE configurations
- Investigate prompt optimization strategies
- Gather training data for fine-tuning experiments
- Document findings systematically for community benefit
This project represents my attempt to understand and extend your work. Win or lose, the real value for me is in the learning process and any wisdom you're willing to share.
Built With
- b200
- huggingface
- python
- pytoch
Log in or sign up for Devpost to join the conversation.