Urbanize: Conditioned Urban Scene Synthesis

Project Category:

Computer Vision and/or Art

Project Idea:

Urbanize explores how AI interprets—and generates—urban environments through human-perceived attributes like wealth and liveliness. Inspired by Deep Learning the City and the Place Pulse 2.0 dataset, we build a Dual-Conditional GAN that learns not only to generate photorealistic street scenes but to systematically manipulate socioeconomic and activity-based attributes.

Instead of training a GAN on raw images alone, we condition both the Generator and Discriminator on two human perception labels: wealth and liveliness

The Discriminator (Judge) performs two tasks simultaneously: Realism: Is the image real or generated? Attribute Accuracy: Does the generated image match the intended wealth & liveliness levels?

The Generator must therefore produce images that are both realistic and perceptually consistent with the requested attribute values. By exploring this 2D conditioning space, we can observe how the model visualizes the interactions between these social cues.

A core part of the project focuses on analysis:

Qualitative Analysis We visualize the GAN’s output as we vary wealth (vertical direction) and liveliness (horizontal). This allows us to interpret what visual cues the model uses—greenery, cars, density, building height, etc.—and how attributes combine or conflict.

Quantitative Analysis We analyze training dynamics, track adversarial imbalance, and compare models trained on different attribute configurations (baseline, wealth-only, multi-attribute, and wealth+lively). We also evaluate whether conditioning stabilizes or destabilizes training.

This project directly engages with course concepts (CNNs, GANs, conditional architectures, loss design) and provides rich grounding for an ethics discussion around biased perception, urban aesthetics, and fairness in machine learning.

What are some key limitations you anticipate facing when working on this project?

We anticipate several challenges as we scale Urbanize. Training a dual-conditional GAN is inherently unstable, and adding multiple perceptual attributes may increase the risk of mode collapse, adversarial imbalance, or low-fidelity outputs. Ensuring stable convergence, especially while generating fine-grained urban detail, will likely be one of the technical bottlenecks of the project.

A second limitation arises from the dataset itself. Place Pulse 2.0 contains subjective, culturally dependent ratings of “wealth” and “liveliness,” which may encode human biases. Since the model learns directly from these perceptions, we expect it to reproduce or even amplify these biases in its generated scenes. This raises questions about fairness and interpretability that will be central to our analysis.

Finally, evaluating success will be challenging because generative quality and perceptual accuracy are difficult to quantify. We anticipate relying heavily on qualitative latent-space explorations and auxiliary prediction models rather than objective metrics. As a result, interpreting the model’s behavior and distinguishing meaningful patterns from correlations will demand careful, visually grounded analysis.

To mitigate these risks, we have a clear fallback plan: if the full dual-conditional GAN proves too unstable, we will decouple the problem. First, we will train a high-quality attribute-prediction model. Then, we will freeze it and use its outputs as a fixed guidance signal to steer a standard GAN. This two-stage pipeline offers a more stable pathway to meaningful results.