Extending Kontext for Multiple Input Images and Scene Consistency

Project Overview

This project aims to extend Kontext to support multiple input images while maintaining scene consistency across generated outputs. The primary goal is to explore the model's capability to handle temporal consistency and generate coherent sequential shots.

Current Status

Work in Progress - No successful complete runs yet. Seeking feedback from the Kontext paper authors or community.

Technical Approach

Attempted Methods

1. Time Offset Encoding (3D RoPE)

Approach: Used time offset in 3D RoPE to encode multiple context images, following suggestions from the original paper
Results: Produced unexpected artifacts
Documentation: WandB Run

2. Spatial Separation (3D RoPE)

Approach: Maintained all context images at t=1 while separating them spatially in 3D RoPE
Results: Improved performance compared to time offset approach
Documentation: WandB Run

Key Challenges

Quality Degradation with Context Length

The most significant issue observed is the degradation of generated image quality as context length increases. This occurs even in the base model before any training modifications.

Open Questions for Discussion

Model Capacity: Is the quality degradation a fundamental model capacity limitation, or can it be addressed through training?
Training Strategy: Would fine-tuning specifically for longer context lengths resolve the quality issues?
Prompt Engineering: Could the text prompting approach be contributing to the degradation problem?
Dataset Requirements: What is the minimum dataset size needed to effectively train the model for:
- Next shot generation
- Temporal consistency maintenance
- Multi-image context handling

Seeking Feedback

Looking for insights from the Kontext paper authors, on:

Best practices for multi-image context encoding
Strategies to maintain quality at longer context lengths
Dataset size and composition recommendations for this specific task
Alternative approaches to achieving scene consistency

Next Steps

Continue experimenting with different RoPE configurations
Investigate prompt optimization strategies
Gather training data for fine-tuning experiments
Document findings systematically for community benefit

This project represents my attempt to understand and extend your work. Win or lose, the real value for me is in the learning process and any wisdom you're willing to share.

Built With

b200
huggingface
python
pytoch

Updates

Private user started this project — Sep 24, 2025 07:49 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.