AI Art Critic Documentation
Published:
In my VIP research course, I developed an AI Art Critic, a system designed to provide constructive feedback on both the technical and emotional aspects of artwork. Initially, this project centered around human-computer interaction, allowing artists to refine their work through a text-based chatbot that delivered stylistic critiques. While designing the chatbot’s personality (turning it into a snobbish art critic) and refining the user experience were enjoyable, I soon realized a fundamental limitation: Textual feedback alone was insufficient for technical corrections in areas like anatomy, perspective, and proportion.
This realization led me to geometry processing and deep learning-based 3D reconstruction, aiming to enhance the AI’s ability to provide precise, visual corrections, similar to how an art teacher might guide a student’s work.
Initial Challenge
A major challenge was enabling the AI to analyze and correct artistic errors visually rather than relying solely on text-based critique. One of the biggest issues was that objects in an image are inherently distorted by perspective and camera projection, making it difficult to train a deep learning model without a well-defined ground truth. This raised a fundamental question: How can an AI predict the “correct” perspective when the submitted artwork is already incorrect? Since the goal was to correct technical aspects of an artwork, I assumed that the submitted images would be photo-realistic or at least follow realistic perspective principles. To tackle this, I first tried to understand what it really means to “see” in the context of AI and artistic critique. I realized that the key to vision and realistic artwork lies in perspective—how objects relate to one another in 3D space and how they project onto a 2D canvas. This led me to study perspective transformations mathematically, experimenting with matrix representations of distortions using PyTorch and other computational tools.
A technique like this could help identify inconsistencies in a drawing, such as detecting when one part of an object does not align with the overall perspective of the scene. However, this alone did not fully address the issue, so I focused on 2D-to-3D reconstruction as a more fundamental solution to ensure accurate spatial understanding and correction.
Why 3D Reconstruction Matters for Artistic Feedback
The reason I focus so much on 3D reconstruction is that it provides context to determine what needs correction in a user-submitted drawing. The key issue is not just detecting mistakes but understanding the scene itself: What is the AI looking at? At what angle is it being viewed? What would the correct version look like in that same perspective? Without this contextual knowledge, 2D corrections alone are unreliable, as they lack an understanding of depth, proportions, and viewpoint distortions.
Proposal
To achieve accurate artistic corrections, I came up with a dual 3D reconstruction approach, where both the user’s submission and an idealized reference model are reconstructed in 3D space before being compared in a consistent viewpoint:
- User’s 3D Reconstruction – Convert the submitted drawing into a 3D mesh, reconstructing how it currently appears. This reconstruction is not necessarily used for final rendering but serves as a guide to extract essential attributes such as camera pose, object detection, and key object features like position and dominant colors. If there are inconsistencies in perspective, the AI can determine the most dominant perspective based on the drawing’s structure and inferred depth cues. This ensures that the reconstruction aligns with the artist’s intended viewpoint rather than being skewed by minor distortions. Understanding these elements allows the AI to analyze the image with proper spatial awareness.
- Idealized 3D construction – Generate a reference 3D model that captures only the key features of how the object should look. It doesn’t take all the information from the image, just the relative camera pose, object detection, and object key features. This generation process can leverage existing 3D shape databases or deep learning techniques such as shape retrieval, latent space interpolation (e.g., ShapeNet, Latent NeRF), and neural style transfer for color approximation. The goal is to create an optimized 3D representation that serves as a benchmark for comparison.
- Pose Matching & Image Recapture – Render both models from the exact same camera angle as the user’s original drawing.
- Comparative Analysis – Compare the two reconstructed 2D images to detect deviations in form, perspective, and proportions.
By aligning the camera pose of the idealized and user-submitted reconstructions, we can perform a pixel-wise and feature-based analysis to highlight specific artistic inaccuracies. This approach makes corrections more intuitive and grounded in spatial reality, rather than relying solely on heuristics or artistic style comparisons.