Policy Improvement Feedback Loop - Complete Explanation
The Feedback Loop Cycle
┌─────────────────────────────────────────────────────────────────┐
│ FEEDBACK LOOP CYCLE │
└─────────────────────────────────────────────────────────────────┘
Step 1: DECISION
Agent decides: Privacy = "private" (weight: 0.65)
↓
Step 2: EXECUTION
Video uploaded as PRIVATE
↓
Step 3: OUTCOME
YouTube metrics: 50 views, 3 likes, user doesn't change privacy
↓
Step 4: FEEDBACK
User gives explicit rating: ***** (5 stars)
OR infers from behavior: User kept it private
↓
Step 5: REWARD SIGNAL
Rating 5/5 → Reward: +1.0
↓
Step 6: POLICY UPDATE
old_weight("private") = 0.65
adjustment = 0.1 * (+1.0) = +0.10
new_weight("private") = 0.75 ← IMPROVED!
↓
Step 7: NEXT DECISION (EPISODE 2)
Agent now MORE likely to use "private" (weight: 0.75)
Loop repeats...
How It Works (Detailed)
Phase 1: Make Decision
# Agent has learned weights for each action
policy_weights = {
'privacy': {
'private': 0.65, # Most likely
'unlisted': 0.25,
'public': 0.10, # Least likely
}
}
# Convert to probabilities
probs = softmax([0.65, 0.25, 0.10])
# Result: [0.54, 0.35, 0.11]
# Sample action
action = random_choice(['private', 'unlisted', 'public'], p=probs)
# Result: 'private' (54% chance)
What's happening:
- Weights represent what agent learned works
- Softmax converts to probabilities
- Random sampling enables exploration (10% chance to try 'public')
- Higher weight = higher probability, but not guaranteed
Phase 2: Execute Decision
Agent's decision: PRIVATE
↓
Upload to YouTube with:
- Privacy: private
- Title: [from analysis]
- Description: [from analysis]
↓
Observe outcomes:
- Views: 50
- Likes: 3
- Shares: 0
- Comments: 1
Phase 3: Receive Feedback
Type 1: Explicit Rating
User rates: ***** (5 stars)
Means: "Great decision, I'm happy with PRIVATE"
Feedback value: 5.0
Type 2: Correction
User changes: PRIVATE → PUBLIC
Means: "Wrong decision, should be PUBLIC"
Feedback value: -1.0 (penalty)
Type 3: Inferred from Behavior
Metrics analysis:
- User kept video PRIVATE (didn't change)
- Shared it with 3 people
- High engagement from those 3
Inference: "Good privacy choice"
Feedback value: +0.7
Phase 4: Convert to Reward
def feedback_to_reward(feedback_type, value):
if feedback_type == 'explicit_rating':
# 5 stars → +1.0, 3 stars → 0.0, 1 star → -1.0
return (value - 3) / 2
elif feedback_type == 'correction':
# User corrected → -0.5 to -1.0
return -0.5 * value
elif feedback_type == 'behavior':
# Engagement metrics
return tanh(value) # Bounded [-1, 1]
# Examples:
feedback_to_reward('explicit_rating', 5) = +1.0 ✓ Perfect!
feedback_to_reward('explicit_rating', 3) = 0.0 ~ Neutral
feedback_to_reward('explicit_rating', 1) = -1.0 ✗ Bad!
feedback_to_reward('correction', 1) = -0.5 ✗ Wrong decision
feedback_to_reward('behavior', 0.8) = +0.66 ✓ Good engagement
Phase 5: Update Policy
learning_rate = 0.1 # How fast to adapt (10% of reward)
old_weight = 0.65
reward = +1.0
adjustment = learning_rate * reward = 0.1 * 1.0 = +0.10
new_weight = old_weight + adjustment = 0.65 + 0.10 = 0.75
new_weight = clip(new_weight, 0.1, 0.9) = 0.75 ✓
# Result: Agent now MORE confident in "private"
With Feedback:
Episode 1: weight = 0.50 → feedback: 5 stars → weight = 0.60
Episode 2: weight = 0.60 → feedback: 5 stars → weight = 0.70
Episode 3: weight = 0.70 → feedback: 5 stars → weight = 0.80
...learns QUICKLY that "private" is good
Without Feedback:
Episode 1: weight = 0.50 → no feedback → weight = 0.50
Episode 2: weight = 0.50 → no feedback → weight = 0.50
Episode 3: weight = 0.50 → no feedback → weight = 0.50
...no learning, random performance
Real Example: Privacy Decision Learning
Starting State (No Learning)
Decision Type: PRIVACY
Actions: private, unlisted, public
Initial weights: equal (all 0.33)
Agent has no idea what users prefer
Makes random decisions
50/50 chance of violating privacy
After 20 Episodes with Feedback
Outcomes observed:
- Users rated "private" videos: avg 4.2/5 ****
- Users rated "public" videos: avg 2.1/5 **
- Users corrected "public" to "private": 5 times
- Users corrected "private" to "public": 1 time
New weights:
private: 0.70 ← Most preferred by users
unlisted: 0.20
public: 0.10 ← Least preferred
Privacy accuracy improved: 50% → 85%
User satisfaction improved: 2.5/5 → 4.0/5
After 50 Episodes with Feedback
Hundreds of data points collected
Clear pattern: Users want privacy by default
Final weights:
private: 0.85 ← Strongly preferred
unlisted: 0.10
public: 0.05 ← Rarely chosen
Privacy accuracy: 95%+ ✓
User satisfaction: 4.5/5 *****
Privacy violations: 0 (last 30 episodes)
The Key Insight: Reward Signal
The reward signal is the bridge between feedback and learning.
User Feedback Reward Signal Policy Update
────────────────────────────────────────────────────────
5-star rating → +1.0 reward → weight +0.10
User correction → -1.0 reward → weight -0.10
High engagement → +0.6 reward → weight +0.06
Low engagement → -0.3 reward → weight -0.03
User deleted → -2.0 reward → weight -0.20 (hard penalty)
Why this matters:
- Without reward signal = no learning
- Wrong reward signal = wrong learning
- Strong reward signal = fast learning
- Weak reward signal = slow learning
Convergence: From Random to Expert
Episode 1-20: Exploration Phase
Agent: "I don't know what works"
Behavior: Tries all options randomly
Accuracy: ~50% (random guessing)
Satisfaction: ~2.5/5 (mixed results)
With feedback:
- Gets ratings on each decision
- Patterns emerge
- Weights start changing
Episode 21-50: Learning Phase
Agent: "I'm noticing patterns"
Behavior: Mostly exploits best option, occasionally explores
Accuracy: ~75% (clear winner identified)
Satisfaction: ~3.8/5 (mostly good decisions)
Feedback impact:
- Each episode refines weights
- Wrong actions quickly penalized
- Good actions reinforced
Episode 51-80: Refinement Phase
Agent: "I'm pretty confident"
Behavior: Usually chooses best option, rare exploration
Accuracy: ~92% (fine-tuning minor edge cases)
Satisfaction: ~4.4/5 (very good decisions)
Feedback impact:
- Marginal improvements
- Edge cases handled
- Policy stabilizing
Episode 81-100: Convergence
Agent: "I've learned the optimal policy"
Behavior: Consistently chooses best option, minimal exploration
Accuracy: >95% (near perfect)
Satisfaction: >4.5/5 (excellent decisions)
Feedback impact:
- Mostly confirms what's learned
- Rare corrections
- Policy stable and ready to deploy
When Feedback Helps vs Doesn't Help
Feedback HELPS When:
✓ Consistent pattern in feedback
(Multiple users agree: "private is better")
✓ Strong signal strength
(5-star vs 1-star, not 3-star which is neutral)
✓ Feedback is timely
(Immediate correction, not delayed)
✓ Diverse feedback
(Different user types, contexts, video types)
Feedback DOESN'T HELP When:
✗ Contradictory feedback
(Some users say "private", others say "public")
✗ Weak signal strength
(Mostly neutral 3-star ratings)
✗ Biased feedback
(All feedback from one user type)
✗ Noisy feedback
(User rating changes randomly)
Practical Implications
For the Agent
With strong feedback loop:
Training time: 50-100 videos
Final accuracy: 95%+
Deployment confidence: High ✓
Without feedback loop:
Training time: 500+ videos (10x longer)
Final accuracy: 70-80%
Deployment confidence: Low ✗
For the User
User who gives feedback:
Episode 1: "Agent made bad decision"
User rates it: * (1 star)
↓
Episode 2: "Same situation"
Agent: NOW chooses differently (learned from feedback)
User: "Much better! 5 stars"
↓
Episode 3: "Similar situation"
Agent: Correct decision automatically
User: "No feedback needed, works great"
Result: Agent learned in 3 episodes via feedback
User who doesn't give feedback:
Episode 1-10: Agent makes random-ish decisions
Episode 11-20: Agent still struggling
Episode 50: Finally learned (took 50x longer)
Result: No learning signal = no improvement
Feedback Loop Metrics
Quantity Metrics
- Episodes with feedback: 30/100 (30%)
- Feedback types: 15 ratings, 8 corrections, 7 inferred
- Total reward signals: 30
Quality Metrics
- Avg feedback strength: 3.2/5 (scale: -1 to +1)
- Feedback consistency: 0.85 (0-1, higher = consistent)
- Feedback-to-improvement ratio: 0.12 (reward improvement per feedback)
Impact Metrics
- Policy updates triggered: 25/30 (83% of feedback → update)
- Weight changes per update: 0.08 (avg adjustment)
- Accuracy improvement per feedback: +2.1% (total 62 percentage points / 30 feedback)
Key Takeaways
The Feedback Loop Formula
Reward = f(feedback_type, value)
ΔWeight = learning_rate × Reward
NewPolicy = OldPolicy + ΔWeight
Three Critical Ingredients
- Feedback Source (User ratings, corrections, behavior)
- Reward Signal (Translation to scalar value)
- Policy Update (Weight adjustment based on reward)
Without ANY ONE of these, learning stops.
No Feedback + Good Reward Function + Good Policy Update = NO LEARNING ✗
Good Feedback + No Reward Function + Good Policy Update = NO LEARNING ✗
Good Feedback + Good Reward Function + No Policy Update = NO LEARNING ✗
Good Feedback + Good Reward Function + Good Policy Update = LEARNING ✓
Optimization: Making Feedback Loops Work Better
Strategy 1: Active Feedback Solicitation
Don't wait for user feedback
Actively ask: "Was this decision helpful?"
Result: More feedback → faster learning
Strategy 2: Diverse Feedback
Collect different feedback types:
- Explicit ratings (strongest signal)
- Corrections (immediate feedback)
- Behavior inference (continuous signal)
Result: Richer learning signal
Strategy 3: Reward Tuning
Adjust reward weights:
- Privacy violations: -2.0 (hard constraint)
- Good ratings: +1.0 (strong reward)
- Corrections: -0.5 (learning signal)
Result: Better guidance to policy updates
Strategy 4: Learning Rate Adaptation
learning_rate = 0.1 initially
→ 0.15 when feedback is strong
→ 0.05 when converged (avoid oscillation)
Result: Fast learning + stable convergence
Feedback is the fuel that drives the learning loop.
Log in or sign up for Devpost to join the conversation.