by Josef Sivic and Alexei A. Efros

Urban-scale quantitative visual analysis opens up new ways Smart Cities can be visualized, modelled, planned and simulated by taking into account large-scale dynamic visual inputs from a range of visual sensors.

Map-based street-level imagery, such as Google Maps with Street View, provides a comprehensive visual record of many cities around the world. For example, the visual appearance of Paris has been captured in almost 100,000 publically available Street View images. We estimate there are approximately 60 million Street View images for France alone, covering all the major cities. In the near future, additional visual sensors are likely to become more wide-spread, for example, cameras are being built into most newly manufactured cars. Another increasing trend is the ability for individuals to continuously capture their daily visual experiences using wearable mobile devices such as the Google Glass. Collectively, all this data provides large-scale, comprehensive and dynamically updated visual records of urban environments.

Automatic analysis of urban visual data
The next opportunity lies in developing automatic tools that allow for the large-scale quantitative analysis of the available urban visual records. Imagine a scenario where we could provide quantitative answers to questions like:

  • What are the typical architectural elements that characterize the visual style of a city (e.g., window or balcony type)? (Figure 1a)
  • What is their geo-spatial distribution? (Figure 1b)?
  • How does the visual style of an area evolve over time?
  • What are the boundaries between visually coherent areas in a city?
Figure 1: Quantitative visual analysis of urban environments from street view imagery [1].  1a: Examples of architectural visual elements characteristic of Paris, Prague and London, identified through the analysis of thousands of Street View images. 1b: An example of a geographic pattern (shown as red dots on the map of Paris) of one visual element,  balconies with cast-iron railings, showing their concentration along the main boulevards. This type of automatic quantitative visual analysis has a potentially significant role in urban planning applications.

These examples just touch on the range of interesting questions that can be posed regarding the visual style of a city. Other types of questions concern the distribution of people and their activities For example, how does the number of people and their activities at a particular place evolve during a day, the seasons or years? Or perhaps you might want to know the make-up of activities on a given street: the presence of tourists sightseeing, locals shopping, the elderly walking their dogs, or children playing. This type of data can also be used to respond to significant urban issues, for example, what are the major causes of bicycle accidents?

New applications
To progress the way we can respond to these types of questions would open-up new ways Smart Cities can be visualized, modelled, planned and simulated by taking large-scale dynamic visual inputs from a range of visual sensors into account. Some examples of how this data might be applied include:

  • the real-time quantitative mapping and visualization of existing urban spaces [1] to support architects and decision makers (Figure 1),
  • the ability to predict and model the evolution of cities [3] (e.g., land-use policies and the way they impact on the visual appearances of different neighbourhoods),
  • obtaining detailed dynamic semantic city-scale 3D reconstructions and using them to simulate different environmental scenarios, e.g., levels of noise, energy consumption or illumination, and
  • the analysis of human activities, e.g., evaluating the future success of a restaurant or the need of to introduce new traffic security measures.

The challenge of urban-scale visual analysis
Impressive demonstrations of the analysis of large-scale data sets have already started to appear in other scientific disciplines. In natural language processing, an analysis of more than 5 million books published between 1800 and 2000 revealed interesting linguistic, sociological and cultural patterns [2]. In the visual domain, however, a similar large-scale analysis has yet to be demonstrated. As visual data and computational resources are becoming more widely available, the key scientific challenge now lies in developing powerful models which can competently meet the spatio-temporal, widely distributed and dynamic characteristics of this visual data. For example, while the vocabulary and grammar of written text are well defined, there is no accepted visual vocabulary and grammar that captures the subtle but important visual differences in architectural styles, or the different visual appearances of human activities on city streets.

Example: quantitative analysis of architectural style
In this first phase of investigation, we considered quantitative visual analysis of architecture style [1]. Using the large repository of geo-tagged imagery, we sought to find a way of automatically identifying which visual elements, e.g., windows, balconies and street signs define a given geo-spatial area (in this case Paris). This is a tremendously difficult task as the differences between the distinguishing features of different places can be very subtle. We were also faced with a difficult search problem: given all the possible patches in all the possible images, which patches are both geographically informative and occur frequently? To address these issues, we proposed a discriminative clustering approach which took into account the weak geographic supervision. We show that geographically representative image elements can be discovered automatically from Google Street View imagery in a discriminative manner. We applied the algorithm on image datasets from 12 cities (Paris, London, Prague, Barcelona, Milan, New York, Boston, Philadelphia, San Francisco, San Paulo, Mexico City and Tokyo), with each dataset featuring approximately 10,000 images. An example of the results was discussed above (and illustrated in Figure 1). This example demonstrates that these learnt elements are visually interpretable and perceptually geo-informative. We further demonstrate that the identification of these elements can support a variety of urban-scale quantitative visual analysis tasks, such as mapping architectural correspondences and influences within and across cities, or finding representative elements at different geo-spatial scales [1].

CityLab@Inria Project Lab on Smart Cities:

[1] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes Paris look like Paris? ACM Transactions on Graphics (SIGGRAPH 2012)
[2] J.B. Michel et al. Quantitative analysis of culture using millions of digitized books. Science, 331(6014):176–182, 2011
[3] C.A. Vanegas et al. Modelling the appearance and behaviour of urban spaces. Computer Graphics Forum, 29(1):25-42, 2010.

Please contact:
Josef Sivic
Inria and Ecole Normale Supérieure, France
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

Alexei A. Efros
UC Berkeley, USA
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

Next issue: January 2018
Special theme:
Quantum Computing
Call for the next issue
Image ERCIM News 98 epub
This issue in ePub format
Get the latest issue to your desktop
RSS Feed