Urbanize: Controllable Urban Scene Synthesis

Gordan Milovac, Arib Syed, Jeffrey Mu, Marcus Winter, Computer VisionGANsDeep LearningUrban PerceptionPython
Back

This project was completed as a final project for CSCI1470: Deep Learning in Spring 2025 at Brown University. Urbanize explores how machine learning models internalize and reproduce human perceptions of urban environments, such as perceived wealth and liveliness, by training conditional generative models on large-scale crowdsourced data.


Urban environments


Background

Human perception of urban spaces—how safe, wealthy, lively, or beautiful a place appears—plays a critical role in city planning, social research, and public policy. Prior work such as Deep Learning the City (Dubey et al., 2016) demonstrated that these perceptions can be quantified at scale using pairwise comparisons of Google Street View images.

However, most existing work focuses on predicting perceptual attributes from images. Far less attention has been given to the inverse problem:
Can generative models learn and manipulate the visual cues that shape human perception of urban environments?

Urbanize addresses this gap by generating urban scenes conditioned on perceptual attributes, providing insight into how models encode socioeconomic and activity-based visual signals.

GitHub Repository:
https://github.com/jeffreymu1/urbanize

Final Poster (PDF):
https://drive.google.com/file/d/1nJrO0u9iBa9XqdJKSajn-ckW4DDPxzC0/view?usp=sharing


Algo


Project Goal

The primary goal of Urbanize was to design a controllable generative framework capable of synthesizing urban scenes that differ systematically in perceived wealth and liveliness, as judged by humans.

Specifically, we aimed to:


Methodology

Dataset and Preprocessing

We used the Place Pulse 2.0 (PP2) dataset, consisting of 110,688 Google Street View images annotated via pairwise comparisons across perceptual attributes.

Preprocessing steps included:

The attributes considered included:


Model Architecture

We implemented and trained four GAN variants:

  1. Baseline GAN
    Trained solely on raw images, without access to perception labels.

  2. Wealth-Conditioned GAN
    Conditioned on continuous “wealth” perception scores.

  3. Multi-Attribute GAN
    Conditioned on all available perceptual attributes.

  4. Wealth + Lively GAN
    Conditioned jointly on wealth and liveliness to study attribute interaction.

All models followed a standard adversarial training framework, with conditioning introduced through auxiliary inputs to the generator and discriminator.


GAN outputs


Training and Evaluation

We monitored:

While quantitative metrics (e.g., FID) were tracked, qualitative visual analysis was the primary evaluation tool, given the subjective nature of perceptual attributes.


Results and Analysis

Baseline Model

Wealth-Conditioned Model

Wealth + Lively Model

Conflicting attribute combinations resulted in degraded or unstable generations, highlighting limitations in disentangling perceptual dimensions.


GAN outputs 2


What Worked and What Didn’t

Successes:

Limitations:


Conclusion

Urbanize shows that conditional generative models can do more than produce realistic images—they can encode and amplify the visual cues humans associate with socioeconomic and activity-based perceptions.

By generating urban scenes along perceptual dimensions such as wealth and liveliness, this project provides insight into how machine learning models reflect—and potentially reinforce—human visual bias. These findings raise important questions about fairness, interpretability, and downstream applications in urban analytics.


Future Work


Author & Contributions

Arib Syed, Jeffrey Mu, Marcus Winter: Dataset exploration, GAN training, conditioning strategy, architecture experimentation, analysis, and final poster preparation.

Gordan Milovac: Project direction pivot, model design, training analysis, visual interpretation, and final poster preparation.

© Gordan Milovac.Resume PDF