Report on Methods and Applications for Crafting 3D Humans (2024)

Lei Liu, Ke Zhao, Bournemouth University

Abstract

This paper presents an in-depth exploration of 3D human model and avatar generation technology, propelled by the rapid advancements in large-scale models and artificial intelligence. The paper reviews the comprehensive process of 3D human model generation, from scanning to rendering, and highlights the pivotal role these models play in entertainment, VR, AR, healthcare, and education. We underscore the significance of diffusion models in generating high-fidelity images and videos. It emphasizes the indispensable nature of 3D human models in enhancing user experiences and functionalities across various fields. Furthermore, this paper anticipates the potential of integrating large-scale models with deep learning to revolutionize 3D content generation, offering insights into the future prospects of the technology. It concludes by emphasizing the importance of continuous innovation in the field, suggesting that ongoing advancements will significantly expand the capabilities and applications of 3D human models and avatars.

Index Terms:

3D human, avatar, AIGC

I Introduction

The rapid advancement of large-scale models[1] and artificial intelligence has significantly transformed the landscape of digital content creation. Among the various innovations, 3D content generation technology stands out due to its extensive applications across numerous domains. This paper focuses on the generation of 3D human models and avatars, exploring their technical foundations, applications, challenges, and future prospects.Diffusion models have shown strong capabilities in high-fidelity images and video generation[2, 3, 4, 5, 6, 7, 8]. Applying pre-trained diffusion models to 3D content generation can significantly reduce the demand for computational resources and the reliance on 3D datasets. 3D generation[9, 10, 11, 12, 13, 14, 15, 16, 17, 18] involves creating three-dimensional digital representations of objects, beings, or environments through sophisticated algorithms and tools. This technology has gained prominence in entertainment, virtual reality (VR), augmented reality (AR), healthcare, and education, offering new dimensions of interactivity and realism. Specifically, 3D human models and avatars play pivotal roles in enhancing user experience and functionality in these fields.

The process of generating 3D human models[19, 19, 20, 21, 12] typically includes scanning, modeling, and rendering. Scanning technologies capture the geometric and textural details of a human subject using 3D scanners or multi-view camera systems. The captured data is then processed through modeling techniques such as mesh modeling and voxel modeling to create a high-fidelity 3D representation. Finally, rendering techniques, including ray tracing and global illumination, convert these models into visually realistic images. These technologies collectively enable the creation of lifelike digital humans that can be animated and manipulated in various digital environments.

In entertainment and media, 3D human models are indispensable. Films and video games leverage these models to bring characters to life, enhancing storytelling and visual impact. For instance, blockbuster movies like ”Avatar” and ”The Avengers” extensively use 3D modeling to create immersive visual effects and realistic characters. Similarly, in video games, these models contribute to creating engaging and interactive experiences, allowing players to immerse themselves in virtual worlds.

Virtual reality (VR) and augmented reality (AR) applications also benefit greatly from 3D human models. In VR, users can interact with highly detailed and realistic human avatars, making the virtual experience more immersive and engaging. AR applications overlay 3D human models onto the real world, providing enhanced interactivity for educational, training, and entertainment purposes. The realism and interactivity provided by 3D human models are crucial for the effectiveness and appeal of VR and AR experiences.

Avatars, or digital representations of users, are another significant application of 3D content generation technology. Avatars are extensively used in social media, online communication, and virtual environments. Creating avatars involves face recognition and modeling, skeletal animation, and expression capture. These avatars can be customized to reflect the user’s appearance, personality, and preferences, providing a personalized and engaging digital presence. Social media platforms like Meta’s Horizon Worlds and Snapchat’s Bitmoji allow users to create and interact through avatars, enhancing social interactions and personal expression. In online education and training, avatars represent teachers and students, facilitating interactive and immersive learning experiences. Virtual meetings and conferences also employ avatars to create more engaging and lifelike interactions, improving communication and collaboration among remote participants.

The rapid evolution of artificial intelligence and machine learning has ushered in a transformative era for 3D modeling, with text-to-image prior-guided text-to-3D modeling at the forefront of this technological revolution. This innovative approach leverages the power of natural language processing to intelligently generate 3D models that correspond to textual descriptions, significantly expanding the horizons and possibilities of 3D content creation. Within this domain, a variety of more nuanced and specialized applications have emerged, including personalized avatar creation.

The application of text-to-avatar allows users to craft unique 3D avatars through simple textual input, reflecting not only their physical characteristics but also their personality and emotional states. Across social media platforms, gaming, and virtual reality environments, this technology offers a novel mode of self-expression, enhancing user interaction and immersive experiences.

This paper delves into text-guided 3D model generation models that are predicated on text-to-image priors, examining their underlying principles, technical frameworks, and practical effectiveness across various application scenarios. Through these investigations, we aim to provide new perspectives and insights into the field of 3D modeling, propelling continuous technological innovation and advancement.

Looking ahead, the integration of large-scale models and deep learning into 3D content generation holds great promise. AI can enhance the accuracy and efficiency of 3D modeling processes, enabling the creation of more realistic and complex models with less manual intervention. Deep learning techniques can improve facial recognition, expression capture, and animation, making avatars more lifelike and expressive.

II 3D Technologies

II-A 3D Representation

3D representation is vital in computer graphics, enabling the digital visualization and manipulation of three-dimensional objects and scenes. Different methods of 3D representation offer varying degrees of detail, efficiency, and application suitability. Here, we delve into several key types of 3D representations, including Explicit Representations, Point Clouds, and Voxels.

Explicit Representations

Explicit representations define 3D objects with precise mathematical descriptions. This category includes mesh-based methods, which represent surfaces using vertices, edges, and faces. Meshes are widely used due to their flexibility and efficiency in rendering.

Meshes, Point Clouds and Voxels

A mesh is composed of polygons, typically triangles, which are connected by their edges and vertices to form the surface of a 3D object. Meshes are favored in applications requiring detailed and smooth surfaces, such as video games, animations, and simulations. They allow for efficient rendering[22, 23] and easy manipulation but can become complex when representing highly detailed surfaces.

Point clouds[24, 25, 25] represent 3D shapes as collections of discrete points in space. Each point has specific (x, y, z) coordinates and often includes attributes like color or normal vectors. Point clouds are useful in scanning and reconstruction applications, as they capture surface geometry directly from real-world scans. They are employed in fields like autonomous driving, where LiDAR sensors generate point cloud data to represent the environment. However, point clouds can be sparse and may require significant processing to convert into other forms of representation, like meshes or voxel grids, for further use. The lack of explicit connectivity between points can complicate operations like surface reconstruction and rendering.

Voxels[26, 27, 28], or volumetric pixels, are the 3D equivalent of pixels in a 2D image. They divide the 3D space into a grid of small cubes, each storing information about the material present at that location. Voxel grids are suitable for applications requiring uniform space representation. They handle complex topologies and internal structures of objects well. However, their memory consumption is high, as high-resolution voxel grids can become very large. This makes them less efficient for detailed surface representations compared to meshes. Advancements in sparse voxel representations and efficient storage techniques are mitigating these issues.

II-B NeRF and 3DGS

Neural Radiance Fields (NeRF)[29, 30, 31, 32] synthesize novel views of complex 3D scenes from sparse sets of 2D images. NeRF represents a scene using a continuous 5D function mapping spatial coordinates and viewing directions to color and density values. This enables photorealistic image creation from new viewpoints by interpolating between input images. NeRF employs a fully-connected neural network to optimize a volumetric scene function, accumulating colors and densities to form a 2D image. The process involves marching rays through the scene, sampling 3D points, and predicting color and density values using the neural network, allowing high-fidelity reconstructions with intricate details and realistic lighting. Classic NeRF has limitations, such as long training times and the need to train a new model for each scene. Extensions like InstantNeRF[33], which uses a multiresolution hash table to accelerate training, and PixelNeRF[34], which leverages convolutional neural networks for view synthesis from a single image, address these challenges.

3D Gaussian Splatting[35, 36, 9, 37, 38] is a sophisticated volumetric rendering technique used for creating and manipulating intricate visual effects within a three-dimensional space. This method involves ”splatting” Gaussian distributions throughout the volume data to simulate the impact of light sources, material properties, and geometric shapes on a scene. Each Gaussian distribution, or ”splat,” represents a minuscule volumetric element of the scene, encapsulating attributes such as color, density, and transparency. By blending these splats, 3D Gaussian Splatting can produce highly realistic imagery, making it a valuable tool for applications in computer graphics, visual effects, and scientific visualization. Potential applications range from virtual property tours and urban planning to creating photorealistic avatars for telepresence in VR environments.

II-C Diffusion Models

Diffusion models are generative models based on the idea of reversing a diffusion process, transforming data from a simple initial distribution to a complex target distribution. The diffusion process involves gradually adding noise to data, while the reverse process denoises it to generate new samples.

Forward Process (Diffusion)

The forward process can be described as a Markov chain where noise is added to the data at each time step. This is represented by the following equation:

q(x_{t}|x_{t-1})=\mathcal{N}(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}\mathbf{%I})

Here:- $x_{t}$ is the data at time step $t$ .- $\beta_{t}$ is a variance schedule controlling the amount of noise added.- $\mathcal{N}$ denotes a Gaussian distribution.

The forward process starts from the original data $x_{0}$ and progressively adds noise:

q(x_{t}|x_{0})=\mathcal{N}(x_{t};\sqrt{\bar{\alpha}_{t}}x_{0},(1-\bar{\alpha}_%{t})\mathbf{I})

where:- $\bar{\alpha}_{t}=\prod_{s=1}^{t}(1-\beta_{s})$ .

Reverse Process (Denoising)

The reverse process learns to denoise the data, moving from the noisy data $x_{t}$ back to the original data $x_{0}$ . The reverse process is parameterized by a neural network $p_{\theta}$ :

p_{\theta}(x_{t-1}|x_{t})=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\sigma^{2}%_{t}\mathbf{I})

Here, $\mu_{\theta}(x_{t},t)$ is the mean predicted by the neural network, and $\sigma^{2}_{t}$ is typically fixed.

Training Objective

The training objective is to minimize the difference between the true reverse process and the model’s predictions. This can be formulated as:

L_{\text{simple}}=\mathbb{E}_{t,x_{0},\epsilon}\left[\left\|\epsilon-\epsilon_%{\theta}(x_{t},t)\right\|^{2}\right]

where:- $\epsilon$ is the noise added in the forward process.- $\epsilon_{\theta}$ is the neural network’s prediction of the noise.

Sampling

To generate new samples, the model starts from random noise $x_{T}$ and iteratively applies the reverse process:

x_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{%\alpha}_{t}}}\epsilon_{\theta}(x_{t},t)\right)+\sigma_{t}z

where $z\sim\mathcal{N}(0,\mathbf{I})$ is Gaussian noise.

In summary, diffusion models leverage a forward process that adds noise to the data and a learned reverse process that removes noise, enabling the generation of complex data distributions from simple initial noise.

Report on Methods and Applications for Crafting 3D Humans (1)

Report on Methods and Applications for Crafting 3D Humans (2)

Report on Methods and Applications for Crafting 3D Humans (3)

Report on Methods and Applications for Crafting 3D Humans (4)

Report on Methods and Applications for Crafting 3D Humans (5)

Report on Methods and Applications for Crafting 3D Humans (6)

Report on Methods and Applications for Crafting 3D Humans (7)

Report on Methods and Applications for Crafting 3D Humans (8)

III Avatar

Text-to-3D[40, 10, 41, 11, 42, 43, 44, 45, 46, 47, 48] is an emerging interdisciplinary field that focuses on converting textual descriptions into three-dimensional models. This technology has a wide range of applications, including virtual reality, gaming, and design.

In recent years, the creation of 3D graphical human models has drawn considerable attention due to its extensive applications in areas such as movie production, video gaming, AR/VR and human-computer interactions, and the creation of 3D avatars through natural language could save resources and holds great research prospects. 3D human generation is a complex and evolving field in computer graphics and artificial intelligence, aiming to create detailed, animatable human models with realistic movements and non-rigid components such as hair and clothing. The work in this area can be broadly divided into two categories: generating the full 3D human body and generating the 3D avatar head. Below, we discuss significant contributions and methodologies in these domains.

The generation of 3D human avatars involves creating diverse virtual humans with different identities, shapes, and poses. This task is particularly challenging due to the variety in clothed body shapes and their complex articulations. Additionally, 3D avatars should be animatable, making the task even more difficult as it requires creating non-rigid components.

Early work in 3D human mesh generation[49, 50, 51] utilized human parametric models. These models represent human body shapes using a small set of parameters that deform a template mesh. While these models facilitate easy synthesis of human shapes by adjusting a few parameters, they often fail to capture finer details due to the fixed mesh topology and limited resolution.

IV Text-to-3D avatar generation

IV-A DreamAvatar

DreamAvatar[39] is a cutting-edge 3D human avatar generation framework. As shown in Fig, 1, 2 and 3, this framework employs a trainable Neural Radiance Field (NeRF) model to predict density and color for 3D points, complemented by pretrained text-to-image diffusion models that offer 2D self-supervision. The key innovations of DreamAvatar are threefold:

Utilization of the SMPL Model:

By leveraging the Skinned Multi-Person Linear (SMPL) model, DreamAvatar provides robust shape and pose guidance, facilitating the generation process to produce 3D models that adhere to human body structures.

Dual-Observation Space Design:

The framework introduces a novel design that incorporates two observation spaces—a canonical space and a posed space—related by a learnable deformation field, aiding in the generation of more comprehensive textures and geometries that are faithful to the target pose.

Joint Optimization of Loss Functions:

DreamAvatar optimizes not only from the perspective of the full body but also from a zoomed-in 3D head viewpoint, addressing the multi-face ”Janus” problem and enhancing facial details in the generated avatars.

DreamAvatar details the implementation of the method, including the use of threestudio for NeRF and Variational Score Distillation (VSD), as well as specific versions of Stable Diffusion and ControlNet. Extensive evaluations on various text prompts have demonstrated DreamAvatar’s effectiveness, outperforming existing text-guided 3D generation methods by producing 3D human avatars with high-quality geometry and consistent textures. Additionally, the paper presents user studies and further analyses that substantiate DreamAvatar’s leading position in 3D avatar generation. It also discusses the limitations of the current implementation, such as the lack of consideration for animation and potential biases inherited from the pretrained diffusion model. Finally, the paper contemplates the societal impact of this technology, highlighting both the positive contributions to metaverse development and the risks associated with the malicious use of such technology.

IV-B DreamFace

DreamFace[19], a pioneering framework introduced, represents a significant stride in the generation of 3D facial assets. As shown in Fig. 4, 5 and 6, this technology stands out for its ability to transform textual prompts into lifelike and customizable 3D faces with intricate details and animations. The methodology underpinning DreamFace is a testament to its innovative approach, combining a coarse-to-fine geometry generation strategy with a dual-path mechanism that employs both a generic Latent Diffusion Model (LDM) and a specialized texture LDM. This dual-path mechanism is pivotal as it ensures the generation of facial textures that are not only diverse but also adhere to the specific requirements of the UV space.

The framework’s two-stage optimization process is another keystone of its success, allowing for efficient and fine-grained synthesis. This process is carried out in both the latent and image spaces, which is crucial for mapping the compact latent space to physically-based textures. The result is a seamless integration of high-quality textures that are essential for photo-realistic rendering.

Moreover, DreamFace’s animation capabilities are enhanced through the support of blendshapes and the introduction of a cross-identity hypernetwork. This empowers the generation of personalized and nuanced facial expressions, enabling the creation of dynamic and engaging characters. The framework’s compatibility with existing computer graphics pipelines and its ability to animate using in-the-wild video footage significantly democratize the creation of 3D facial assets, making it accessible to both professionals and novices alike.

DreamFace’s methodological sophistication lies in its multi-stage generation process that progressively refines the facial geometry, meticulously applies physically-based textures, and enriches the asset with animatable features. This comprehensive approach positions DreamFace at the forefront of 3D facial asset generation technology, offering a versatile solution for a myriad of applications across the digital content creation spectrum.

IV-C AvatarCraft

As shown in Fig. 7 and 8, AvatarCraft[20] is an innovative approach to transforming text descriptions into high-quality 3D human avatars with controllable poses and shapes. This method leverages diffusion models to guide the learning of geometry and texture for a neural avatar based on a single text prompt.

AvatarCraft begins with a basic neural human avatar template. Given a text prompt, It uses diffusion models to update the template with geometry and texture that align with the text description. To address the challenge of distorted geometry and textures that can arise from direct application of diffusion models, the team designed a novel coarse-to-fine multi-bounding-box training strategy. This strategy improves global style consistency while preserving fine details.

A key aspect of AvatarCraft is shape regularization, introduced to stabilize the optimization process by penalizing the accumulated ray opacity of the avatar. This approach helps to prevent the generation of degenerate solutions such as empty volumes or flat geometry.

For animation capabilities, AvatarCraft employs a deformation technique that uses an explicit warping field. This field maps the target human mesh to a template human mesh, with both meshes represented using parametric human models. This method simplifies the animation and reshaping of the generated avatar by controlling pose and shape parameters, without the need for additional training.

The AvatarCraft framework also introduces a coarse-to-fine generation strategy that captures style details at different scales. This strategy is critical for generating avatars with important visual features correctly aligned, as it ensures that finer texture details are generated as the resolution increases.

In terms of implementation, the team combined the neural implicit field architecture of NeuS with the fast training approach of Instant-NGP. This combination improves reconstruction speed while maintaining quality. The avatar creation process involves training on multi-view renderings of a bare SMPL mesh, using the Score Distillation Score (SDS) loss from DreamFusion for guidance.

Extensive experiments using various text descriptions demonstrate that AvatarCraft is effective and robust in creating human avatars and rendering novel views, poses, and shapes. The project’s webpage provides a more immersive view into the 3D results generated by the method.

V Applications

The development of advanced techniques for 3D human and avatar generation has opened up a myriad of applications across various industries. These applications leverage the ability to create detailed, animatable, and realistic virtual humans for purposes ranging from entertainment to healthcare. Below, we explore several key areas where 3D human and avatar technologies are making a significant impact.

V-A Entertainment and Media

One of the most prominent applications of 3D human and avatar generation is in the entertainment and media industry. Virtual humans are used extensively in video games, movies, and virtual reality (VR) experiences. In video games, 3D avatars provide players with customizable characters that enhance immersion and personal engagement. Games like ”The Sims” and ”Cyberpunk 2077” showcase highly detailed and customizable avatars, contributing to a richer player experience.

In the film industry, 3D avatars allow for the creation of realistic digital doubles of actors. These digital doubles can perform stunts, de-age actors, or even resurrect deceased performers, as seen with characters like Princess Leia in ”Rogue One: A Star Wars Story.” Moreover, fully animated movies like ”Avatar” and ”Toy Story” rely heavily on 3D character generation to bring fantastical worlds and characters to life.

Virtual reality and augmented reality (AR) also benefit from 3D avatars. In VR, users can interact with fully immersive environments where avatars represent themselves and others, enhancing social presence and interaction. In AR, applications like Snapchat filters and virtual try-on for fashion use 3D avatars to overlay virtual objects onto the real world.

V-B Social Media and Communication

Social media platforms and communication tools are increasingly incorporating 3D avatars to enhance user interaction. Platforms like Facebook and Snapchat allow users to create personalized avatars, which can be used in posts, stories, and as virtual representations in online meetings. These avatars add a layer of personalization and expression, enabling users to convey emotions and personalities more effectively than through text or 2D images alone.

In communication, 3D avatars are used in virtual meetings and telepresence systems. Companies like Spatial and Zoom are developing technologies that allow users to join meetings as 3D avatars, creating a more engaging and interactive experience compared to traditional video calls. These avatars can mimic the user’s gestures and facial expressions, making remote communication feel more natural and connected.

V-C E-commerce and Retail

In the e-commerce and retail sectors, 3D avatars are revolutionizing the way consumers shop online. Virtual try-on solutions allow customers to see how clothes, accessories, and even cosmetics look on a 3D model of themselves before making a purchase. This technology improves the shopping experience by helping consumers make more informed decisions and reducing the rate of returns due to sizing and fit issues.

For example, companies like Warby Parker and Sephora use augmented reality to let users try on glasses and makeup virtually. Similarly, fashion retailers are developing virtual fitting rooms where customers can upload a 3D scan of their body to try on clothing. This not only enhances the shopping experience but also allows for more personalized recommendations and a deeper connection between brands and consumers.

V-D Healthcare and Rehabilitation

The healthcare industry is leveraging 3D human and avatar technologies for various applications, including medical training, patient care, and rehabilitation. In medical training, realistic 3D avatars can simulate complex surgical procedures, allowing students and professionals to practice in a controlled, risk-free environment. These simulations can include detailed anatomy and realistic physiological responses, providing a valuable learning tool.

For patient care, 3D avatars are used in telemedicine to facilitate remote consultations. Doctors can use avatars to explain medical conditions and treatments in a more visual and understandable way. In rehabilitation, virtual avatars are used in therapeutic exercises, helping patients recover from injuries or surgeries. For instance, patients can perform physical therapy exercises guided by a virtual avatar that demonstrates the correct movements and provides feedback.

V-E Fitness and Wellness

3D avatars are also finding applications in the fitness and wellness industry. Virtual fitness trainers use avatars to guide users through workout routines, providing a more engaging and personalized experience. These trainers can adapt workouts based on the user’s performance and goals, offering real-time feedback and motivation.

Moreover, wellness apps are incorporating avatars to help users visualize their progress in areas like weight loss or muscle gain. By creating a 3D model that evolves with the user’s physical changes, these apps provide a tangible representation of progress, which can be highly motivating.

V-F Education and Training

In the field of education and training, 3D avatars are used to create immersive learning environments. Virtual classrooms and training simulations can incorporate avatars to represent students and instructors, facilitating interactive and collaborative learning experiences. For example, language learning apps can use avatars to create conversational scenarios where learners practice speaking with virtual characters.

Corporate training programs also use 3D avatars to simulate real-world scenarios, such as customer service interactions or emergency response drills. These simulations provide employees with hands-on experience in a safe and controlled setting, improving their skills and confidence.

V-G Fashion and Personalization

The fashion industry is leveraging 3D avatar technology to streamline design processes and enhance personalization. Designers use 3D avatars to create virtual prototypes of clothing, allowing for rapid iteration and visualization of designs before physical production. This not only speeds up the design process but also reduces waste by minimizing the need for physical samples.

For consumers, 3D avatars enable a new level of personalization. Custom clothing brands can use avatars to create perfectly fitted garments based on a 3D scan of the customer’s body. This technology ensures a better fit and more satisfaction with the final product.

V-H Cultural Heritage and Preservation

In the field of cultural heritage, 3D avatars are used to recreate historical figures and scenes. Museums and educational institutions use these avatars to bring history to life, providing visitors with interactive and engaging experiences. Virtual tours can feature avatars of historical figures who guide visitors through exhibits, sharing stories and insights.

Moreover, 3D avatars are used in the preservation of cultural heritage sites. By creating detailed 3D models of sites and artifacts, researchers can study and preserve these cultural treasures for future generations. These models can be used for virtual reconstructions, allowing people to explore historical sites in their original form.

The applications of 3D human and avatar generation are vast and diverse, impacting numerous industries and aspects of daily life. From enhancing entertainment and social media experiences to revolutionizing e-commerce, healthcare, and education, the ability to create realistic and animatable 3D humans offers significant benefits. As technology continues to advance, we can expect even more innovative uses of 3D avatars, further integrating them into our digital and physical worlds.

VI Conclusion

In conclusion, the integration of AI with 3D human model and avatar generation has revolutionized digital content creation, offering higher quality and broader access. The paper emphasizes the technology of personalized avatar creation and the significant applications across industries. While the technology promises significant benefits, it also presents challenges that need to be managed responsibly. As these advancements continue, they are set to reshape our digital experiences and interactions.

References

[1]R.Rombach, A.Blattmann, D.Lorenz, P.Esser, and B.Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
[2]P.Li, Q.Huang, Y.Ding, and Z.Li, “Layerdiffusion: Layered controlled image editing with diffusion models,” in SIGGRAPH Asia 2023 Technical Communications, 2023, pp. 1–4.
[3]A.Hertz, R.Mokady, J.Tenenbaum, K.Aberman, Y.Pritch, and D.Cohen-Or, “Prompt-to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022.
[4]N.Ruiz, Y.Li, V.Jampani, Y.Pritch, M.Rubinstein, and K.Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 500–22 510.
[5]J.Ho, W.Chan, C.Saharia, J.Whang, R.Gao, A.Gritsenko, D.P. Kingma, B.Poole, M.Norouzi, D.J. Fleet etal., “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022.
[6]T.Brooks, A.Holynski, and A.A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 392–18 402.
[7]G.Couairon, J.Verbeek, H.Schwenk, and M.Cord, “Diffedit: Diffusion-based semantic image editing with mask guidance,” arXiv preprint arXiv:2210.11427, 2022.
[8]A.Ramesh, M.Pavlov, G.Goh, S.Gray, C.Voss, A.Radford, M.Chen, and I.Sutskever, “Zero-shot text-to-image generation,” in International conference on machine learning.Pmlr, 2021, pp. 8821–8831.
[9]J.Tang, J.Ren, H.Zhou, Z.Liu, and G.Zeng, “Dreamgaussian: Generative gaussian splatting for efficient 3d content creation,” arXiv preprint arXiv:2309.16653, 2023.
[10]P.Li, C.Tang, Q.Huang, and Z.Li, “Art3d: 3d gaussian splatting for text-guided artistic scenes generation,” arXiv:2405.10508, 2024.
[11]C.-H. Lin, J.Gao, L.Tang, T.Takikawa, X.Zeng, X.Huang, K.Kreis, S.Fidler, M.-Y. Liu, and T.-Y. Lin, “Magic3d: High-resolution text-to-3d content creation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 300–309.
[12]Z.Xiong, D.Kang, D.Jin, W.Chen, L.Bao, S.Cui, and X.Han, “Get3dhuman: Lifting stylegan-human into a 3d generative model using pixel-aligned reconstruction priors,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9287–9297.
[13]R.Shi, H.Chen, Z.Zhang, M.Liu, C.Xu, X.Wei, L.Chen, C.Zeng, and H.Su, “Zero123++: a single image to consistent multi-view diffusion base model,” arXiv preprint arXiv:2310.15110, 2023.
[14]A.Raj, S.Kaza, B.Poole, M.Niemeyer, N.Ruiz, B.Mildenhall, S.Zada, K.Aberman, M.Rubinstein, J.Barron etal., “Dreambooth3d: Subject-driven text-to-3d generation,” arXiv preprint arXiv:2303.13508, 2023.
[15]J.Sun, B.Zhang, R.Shao, L.Wang, W.Liu, Z.Xie, and Y.Liu, “Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior,” arXiv preprint arXiv:2310.16818, 2023.
[16]P.Li, B.Li, and Z.Li, “Sketch-to-Architecture: Generative AI-aided Architectural Design,” in Pacific Graphics Short Papers and Posters.The Eurographics Association, 2023.
[17]P.Li and Z.Li, “Efficient temporal denoising for improved depth map applications,” in Proc. Int. Conf. Learn. Representations, Tiny papers, 2023.
[18]R.Ranftl, K.Lasinger, D.Hafner, K.Schindler, and V.Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,” IEEE transactions on pattern analysis and machine intelligence, vol.44, no.3, pp. 1623–1637, 2020.
[19]L.Zhang, Q.Qiu, H.Lin, Q.Zhang, C.Shi, W.Yang, Y.Shi, S.Yang, L.Xu, and J.Yu, “DreamFace: Progressive generation of animatable 3d faces under text guidance,” ACM Trans. Graph., 2023.
[20]R.Jiang, C.Wang, J.Zhang, M.Chai, M.He, D.Chen, and J.Liao, “Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control,” arXiv preprint arXiv:2303.17606, 2023.
[21]C.-Y. Weng, B.Curless, P.P. Srinivasan, J.T. Barron, and I.Kemelmacher-Shlizerman, “Humannerf: Free-viewpoint rendering of moving people from monocular video,” in Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition, 2022, pp. 16 210–16 220.
[22]M.Botsch, L.Kobbelt, M.Pauly, P.Alliez, and B.Lévy, Polygon mesh processing.CRC press, 2010.
[23]L.A. Shirman and C.H. Sequin, “Local surface interpolation with bézier patches,” Computer Aided Geometric Design, vol.4, no.4, pp. 279–295, 1987.
[24]K.-A. Aliev, A.Sevastopolsky, M.Kolos, D.Ulyanov, and V.Lempitsky, “Neural point-based graphics,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16.Springer, 2020, pp. 696–712.
[25]G.Kopanas, J.Philip, T.Leimkühler, and G.Drettakis, “Point-based neural rendering with per-view optimization,” in Computer Graphics Forum, vol.40, no.4.Wiley Online Library, 2021, pp. 29–43.
[26]L.Liu, J.Gu, K.ZawLin, T.-S. Chua, and C.Theobalt, “Neural sparse voxel fields,” Advances in Neural Information Processing Systems, vol.33, pp. 15 651–15 663, 2020.
[27]C.Sun, M.Sun, and H.-T. Chen, “Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5459–5469.
[28]D.Maturana and S.Scherer, “Voxnet: A 3d convolutional neural network for real-time object recognition,” in 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS).IEEE, 2015, pp. 922–928.
[29]B.Mildenhall, P.P. Srinivasan, M.Tancik, J.T. Barron, R.Ramamoorthi, and R.Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in European conference on computer vision.Springer, 2020, pp. 405–421.
[30]M.Niemeyer and A.Geiger, “GIRAFFE: Representing scenes as compositional generative neural feature fields,” in CVPR, 2021.
[31]J.T. Barron, B.Mildenhall, D.Verbin, P.P. Srinivasan, and P.Hedman, “Mip-nerf 360: Unbounded anti-aliased neural radiance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5470–5479.
[32]A.Chen, Z.Xu, F.Zhao, X.Zhang, F.Xiang, J.Yu, and H.Su, “MVSNeRF: Fast generalizable radiance field reconstruction from multi-view stereo,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 124–14 133.
[33]T.Müller, A.Evans, C.Schied, and A.Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Transactions on Graphics (ToG), vol.41, no.4, pp. 1–15, 2022.
[34]A.Yu, V.Ye, M.Tancik, and A.Kanazawa, “pixelnerf: Neural radiance fields from one or few images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4578–4587.
[35]R.Abdal, W.Yifan, Z.Shi, Y.Xu, R.Po, Z.Kuang, Q.Chen, D.-Y. Yeung, and G.Wetzstein, “Gaussian shell maps for efficient 3d human generation,” arXiv preprint arXiv:2311.17857, 2023.
[36]Z.Chen, F.Wang, and H.Liu, “Text-to-3d using gaussian splatting,” arXiv preprint arXiv:2309.16585, 2023.
[37]X.Liu, X.Zhan, J.Tang, Y.Shan, G.Zeng, D.Lin, X.Liu, and Z.Liu, “Humangaussian: Text-driven 3d human generation with gaussian splatting,” arXiv preprint arXiv:2311.17061, 2023.
[38]T.Yi, J.Fang, G.Wu, L.Xie, X.Zhang, W.Liu, Q.Tian, and X.Wang, “Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors,” arXiv preprint arXiv:2310.08529, 2023.
[39]Y.Cao, Y.-P. Cao, K.Han, Y.Shan, and K.-Y.K. Wong, “Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models,” arXiv preprint arXiv:2304.00916, 2023.
[40]L.G. Foo, H.Rahmani, and J.Liu, “Ai-generated content (aigc) for various data modalities: A survey,” arXiv preprint arXiv:2308.14177, vol.2, 2023.
[41]B.Poole, A.Jain, J.T. Barron, and B.Mildenhall, “DreamFusion: Text-to-3d using 2d diffusion,” in Int. Conf. Learn. Represent., 2023.
[42]P.Li and B.Li, “Generating daylight-driven architectural design via diffusion models,” arXiv preprint arXiv:2404.13353, 2024.
[43]E.R. Chan, K.Nagano, M.A. Chan, A.W. Bergman, J.J. Park, A.Levy, M.Aittala, S.DeMello, T.Karras, and G.Wetzstein, “Generative novel view synthesis with 3d-aware diffusion models,” arXiv preprint arXiv:2304.02602, 2023.
[44]J.Tang, T.Wang, B.Zhang, T.Zhang, R.Yi, L.Ma, and D.Chen, “Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior,” arXiv preprint arXiv:2303.14184, 2023.
[45]T.Shen, J.Gao, K.Yin, M.-Y. Liu, and S.Fidler, “Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis,” Advances in Neural Information Processing Systems, vol.34, pp. 6087–6101, 2021.
[46]Z.Wang, C.Lu, Y.Wang, F.Bao, C.Li, H.Su, and J.Zhu, “Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation,” arXiv preprint arXiv:2305.16213, 2023.
[47]W.Li, R.Chen, X.Chen, and P.Tan, “Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d,” arXiv preprint arXiv:2310.02596, 2023.
[48]G.Metzer, E.Richardson, O.Patashnik, R.Giryes, and D.Cohen-Or, “Latent-nerf for shape-guided generation of 3d shapes and textures,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 663–12 673.
[49]D.Anguelov, P.Srinivasan, D.Koller, S.Thrun, J.Rodgers, and J.Davis, “Scape: shape completion and animation of people,” in ACM SIGGRAPH 2005 Papers, 2005, pp. 408–416.
[50]H.Joo, T.Simon, and Y.Sheikh, “Total capture: A 3d deformation model for tracking faces, hands, and bodies,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8320–8329.
[51]A.A. Osman, T.Bolkart, and M.J. Black, “Star: Sparse trained articulated human body regressor,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16.Springer, 2020, pp. 598–613.
[52]X.Chen, Y.Zheng, M.J. Black, O.Hilliges, and A.Geiger, “Snarf: Differentiable forward skinning for animating non-rigid neural implicit shapes,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 594–11 604.
[53]X.Chen, T.Jiang, J.Song, J.Yang, M.J. Black, A.Geiger, and O.Hilliges, “gdna: Towards generative detailed neural avatars,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 427–20 437.
[54]J.Zhang, Z.Jiang, D.Yang, H.Xu, Y.Shi, G.Song, Z.Xu, X.Wang, and J.Feng, “Avatargen: a 3d generative model for animatable human avatars,” in European Conference on Computer Vision.Springer, 2022, pp. 668–685.
[55]A.Bergman, P.Kellnhofer, W.Yifan, E.Chan, D.Lindell, and G.Wetzstein, “Generative neural articulated radiance fields,” Advances in Neural Information Processing Systems, vol.35, pp. 19 900–19 916, 2022.
[56]L.Kavan, S.Collins, J.Žára, and C.O’Sullivan, “Geometric skinning with approximate dual quaternion blending,” ACM Transactions on Graphics (TOG), vol.27, no.4, pp. 1–23, 2008.
[57]X.C. Wang and C.Phillips, “Multi-weight enveloping: least-squares approximation techniques for skin animation,” in Proceedings of the 2002 ACM SIGGRAPH/Eurographics symposium on Computer animation, 2002, pp. 129–138.
[58]B.Deng, J.P. Lewis, T.Jeruzalski, G.Pons-Moll, G.Hinton, M.Norouzi, and A.Tagliasacchi, “Nasa neural articulated shape approximation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16.Springer, 2020, pp. 612–628.
[59]G.Tiwari, N.Sarafianos, T.Tung, and G.Pons-Moll, “Neural-gif: Neural generalized implicit functions for animating people in clothing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 708–11 718.
[60]S.Peng, J.Dong, Q.Wang, S.Zhang, Q.Shuai, X.Zhou, and H.Bao, “Animatable neural radiance fields for modeling dynamic human bodies,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 314–14 323.
[61]L.Liu, M.Habermann, V.Rudnev, K.Sarkar, J.Gu, and C.Theobalt, “Neural actor: Neural free-view synthesis of human actors with pose control,” ACM transactions on graphics (TOG), vol.40, no.6, pp. 1–16, 2021.
[62]Z.Zheng, H.Huang, T.Yu, H.Zhang, Y.Guo, and Y.Liu, “Structured local radiance fields for human avatar modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 893–15 903.
[63]V.Blanz and T.Vetter, “A morphable model for the synthesis of 3d faces,” in Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 2023, pp. 157–164.
[64]D.Vlasic, M.Brand, H.Pfister, and J.Popovic, “Face transfer with multilinear models,” in ACM SIGGRAPH 2006 Courses, 2006, pp. 24–es.
[65]P.Paysan, R.Knothe, B.Amberg, S.Romdhani, and T.Vetter, “A 3d face model for pose and illumination invariant face recognition,” in 2009 sixth IEEE international conference on advanced video and signal based surveillance.Ieee, 2009, pp. 296–301.
[66]L.Tran and X.Liu, “On learning 3d face morphable model from in-the-wild images,” IEEE transactions on pattern analysis and machine intelligence, vol.43, no.1, pp. 157–171, 2019.
[67]J.Sun, X.Wang, Y.Shi, L.Wang, J.Wang, and Y.Liu, “IDE-3D: Interactive disentangled editing for high-resolution 3d-aware portrait synthesis,” ACM Trans. Graph., 2022.