Open Standards for the Immersive Web

by Dave Raggett (ERCIM)

The Web grew exponentially on the back of open standards, rapidly eclipsing proprietary alternatives. The same principle will apply to the Immersive Web. This article introduces W3C’s work on related standards and how they fit into the standardisation landscape; key challenges for privacy, accessibility, and scaling; and directions for future work, including Generative AI.

As an old-timer, I helped to coordinate early work on the World Wide Web, including the initial standards for HTTP and HTML. In mid-1994, I proposed a vision for extending the Web to support immersive virtual reality [L1]. Work on VRML sadly didn’t quite live up to this vision, but years later computers and networks are hugely better, so the future looks very promising. The Web grew exponentially on the back of widely supported open standards (HTTP, HTML, CSS, JavaScript, etc.), rapidly eclipsing proprietary alternatives. The same principle will now apply to the Immersive Web.

Expected application classes include: shopping, entertainment, education, industry, online meetings and desktop replacement. Smart phones and tablets can be used to show how your new kitchen, or new furniture and decor for your living room would look as an augmented reality experience. You can see how you look in a smart mirror (e.g. your phone) when wearing some new clothes, new shoes, different hair styles, jewellery and glasses, before purchasing them online. For entertainment, you will be able to play VR and AR games for mutual enjoyment with your friends. You will have access to educational experiences that set you tasks to solve in an immersive VR or AR environment. Industry applications will use smart glasses for machine maintenance, smart warehouses, etc. Online meetings that project you and your companions into a shared immersive VR or AR environment will be a vast improvement on today's video meetings. VR and AR will function as a computer desktop replacement, offering unlimited virtual screens via headsets or smart glasses.

Simple 3D models can be used to interpret video from 2D cameras to create live avatars of people. This was demonstrated in the mid-nineties as a means to reduce the bandwidth for video calls. Today's computers are much more powerful, and can generate detailed 3D animations of people's heads in real-time, including facial expressions, derived using video from the camera built into the laptop or smart phone. People could appear as themselves or as avatars of their own choosing.

Devices include: VR headsets, smart glasses for AR, large 2D displays, glasses-free 3D displays, e.g. microlens array-based monitors that beam different images into each of your eyes; cameras for imaging you and your environment; specialised devices such as stereo and depth sensing cameras, as well as 360 cameras that see in all directions at once; microphones and speakers for audio (including surround sound and spatial audio); orientation, motion and acceleration sensors; games controllers, use of cameras for gaze tracking and hand gestures; and haptic devices, e.g. smart gloves.

W3C is addressing this need with the Immersive Web Working Group (WebXR) [L2], whose participating organisations include Adobe, Apple, Google, Meta, Microsoft, Samsung and many others. WebXR is developing a suite of standards for use by browsers and other software. It provides a variety of modular APIs for access to gamepads, augmented reality overlays, hit test, layers, hand gestures, depth sensing, hyperlink anchors, lighting estimation and so forth. Other W3C Working Groups have developed complementary standards including WebGPU for access to GPU hardware in conjunction with the shading language WGSL; Web Audio, Web Neural Networks, Web Assembly, WebRTC, Web Sockets, RDF/Linked Data, the Web of Things, and many more.

W3C is one of many standards’ development organisations with an interest in extended reality. The IEEE VR and AR Working Group developed the P2048 suite of standards, including device and video taxonomies, personal identity, environment safety, immersive user interfaces, mapping virtual objects into the real world, and associated interoperability [L3]. The Khronos Group is responsible for glTF, a JSON-based format for 3D assets, and WebGL, a JavaScript API for rendering interactive 2D and 3D graphics in Web browsers. glTF 2.0 is now available as ISO/IEC 12113:2022 and work continues on extensions, e.g. for lighting. WebGPU is likely to slowly take over from WebGL over a period of years. X3D is a suite of ISO/IEC standards developed by the Web3D consortium, covering graphics formats and APIs. Universal Scene Description (USD) is an open-source framework for 3D originally developed by Pixar, with support from Adobe, Apple, Autodesk and NVIDIA. Opportunities for open-source developers include work on libraries for software and 3D assets, as well as content creation, e.g. the Babylon.js and Godot 2D/3D engines. Generative AI can be expected to play a major role in the Immersive Web, with the ability to synthesise rich virtual environments from simple prompts and populate them using information from servers and peer to peer connections.

The Immersive Web faces many challenges: how to preserve the user’s privacy while offering an optimal user experience, how to provide accessibility for users with physical disabilities, how to decouple applications from the details of devices, how to scale to richer 3D environments, and issues around user fatigue and motion sickness. Hand gestures are a great way to control applications, but risks fatigue over long periods. Noticeable lags in rendering can likewise cause fatigue and motion sickness. In respect to scaling, large 3D assets can take significant time to download, shattering the illusion of an immersive 3D world. This can be mitigated using cloud-based rendering or with level of detail control plus pre-loading libraries of assets, along with an option to substitute and adapt assets based upon their similarity.

To simplify applications and to support accessibility, we need higher-level intent-based models and declarative formats. Application frameworks can help, but without standards, they risk fragmentation into incompatible solutions. Browser support for higher-level models would enable accessibility without the need for applications to know about the user’s accessibility preferences.

Websites are independent, and discoverable via search engines and hypertext links. The Immersive Web is likely to be similar with many independent applications. An open question is what would attract businesses to set up in shared virtual worlds as envisaged in Science Fiction novels and movies? For instance, Neal Stephenson's "Snow Crash" (1992) and William Gibson's "Necromancer" (1984). The virtual real-estate is valuable only if there is a commercial advantage to being found in the same vicinity as others, however, unlike the real world, users can effortlessly jump from one virtual location to another, making adjacency less important, thereby emphasising the importance of services the virtual world can offer to its inhabitants, and likewise for real-world high streets threatened by the rise of online shopping.

I would like to acknowledge guidance in preparing this article from my ERCIM colleague François Daoust.

Links:
[L1] https://www.w3.org/People/Raggett/vrml/vrml.html
[L2] https://www.w3.org/immersive-web/
[L3] https://digitalreality.ieee.org/standards

Please contact:
Dave Raggett, ERCIM, France
This email address is being protected from spambots. You need JavaScript enabled to view it.