Diffusion in the Wild: Pose-Controlled Animal Generation with ControlNet
Penggao Gu (pgu3), Ruoheng Yuan (ryuan19), Yichen Li (yli937), Muchen Zhong (mzhong18)
Introduction AI-generated art and picture manipulation are topics of great interest to our team. The ControlNet paper, a landmark study in programmable picture production and style transfer, will be reimplemented for this project. Conventional Stable Diffusion can be erratic, prompt-sensitive, and has poor spatial controls. In order to preserve the original generation quality, ControlNet freezes the base SD weights while adding explicit control to Stable Diffusion. This method addresses the main shortcomings of conventional Stable Diffusion by enabling users to lock down posture, layout, edges, depth, or structure. We will train a ControlNet model intended to produce pose-controlled animal photos for this particular project.
Related Work A closely related study to our project would be the Pose Guided Person Generation Network, ( or PG2,) which shows an earlier framework for pose-conditioned image synthesis. To be more specific, PG2 takes a simple two-stage approach: it first generates a rough image that matches the target pose and then sharpens it with a second refinement network (Ma et al., 2017). Even though PG2 uses GANs instead of diffusion models, it describes the same kinds of issues, such as keeping poses accurate or handling background changes, and this makes it very relevant to our project. To make sure the architecture is consistent with the original method, we discovered two public released pose-conditioned ControlNet models on Hugging Face, including the sd-controlnet-openpose implementation by lllyasviel and the SDXL-based controlnet-openpose-sdxl-1.0 model by xinsir (lllyasviel/Sd-controlnet-openpose · Hugging Face, n.d.; xinsir/controlnet-openpose-sdxl-1.0 · Hugging Face, n.d.).
Data For this project, we use the AP10K-poses-controlnet-dataset, a public available dataset built specifically for training ControlNet-style pose-conditioned diffusion models (JFoz/AP10K-poses-controlnet-dataset · Datasets at Hugging Face, n.d.). The dataset contains about roughly 7,020 training samples, and each sample includes four aligned components as following: (1) a 512×512 RGB image of an animal; (2) a conditioning pose image rendered as an OpenPose-style skeleton; (3) an overlay image combining the original image and its pose maps; and (4) a short caption describing the animal. Everything is stored in Parquet format. Since this dataset fits our project pretty well because it mirrors the exact structure used in the paper, image + pose map + caption, we choose it as our training dataset. For preprocessing, we don’t need to do much. This is because the dataset already provides the pose maps in the exact 512×512 resolution, which is the expected input for Stable Diffusion and ControlNet. So our preprocessing should mainly focus on normalizing pixel values, encode the images into the latent space using the Stable Diffusion VAE, batch everything, and tokenize the captions through the CLIP text encoder. Since the dataset is overall relatively clean as well as consistent, we don’t need to do any heavy filtering or cleaning works.
Methodology We reimplement the core idea of ControlNet for pose-controlled animal image generation. Our model keeps Stable Diffusion 1.5 frozen and adds a small trainable branch that encodes animal-pose images. The branch converts the pose map into features and injects them into selected U-Net layers so the backbone can follow the input structure while keeping its original visual quality. Only the new control layers are updated during training, and the backbone remains fixed. The main challenges are limited compute and the need to build the control path from scratch. Small batch sizes make the training process more sensitive and harder to stabilize. Re-creating the conditioning path also requires careful handling of feature sizes and injection points across different U-Net resolutions. Any mismatch can break training or cause the model to ignore the pose. These technical details are the most difficult parts of the project.
Metrics Using Stable Diffusion 1.5 as the basic model, the model will be trained on a dataset of animal poses. To direct the output of the Stable Diffusion generation process, explicit control layers will be added. Only the extra control layers will be trained to give the original model explicit structural control; the base SD weights will stay constant. Metrics applied to this project:
- MSE loss for monitoring training and validation error.
- FID score for evaluating generative quality.
- CLIP score for measuring text–image alignment.
Ethics Our dataset consists of animal images, pose representations, and brief text captions. Although this dataset does not contain sensitive human information, it still carries several notable biases. The distribution of species, poses, and viewpoints is uneven, and the pose-estimation process tends to fail more often on small or uncommon animals. These biases may cause the trained model to favor well-represented species and typical poses while producing less reliable results for underrepresented cases, potentially limiting the generality of the system. The evaluation metrics discussed previously highlight image quality and text–image alignment, but they cannot fully capture structural fidelity to the input pose. From an ethical perspective, this gap creates a risk of overestimating model reliability when quantitative scores appear strong despite underlying pose inconsistencies. Such discrepancies underscore the need for qualitative inspection and careful interpretation of results. They also illustrate a broader challenge in controllable generation systems: numerical metrics may not reflect the specific failure modes most relevant to the intended use of the model.
Division of Labor Data Preprocess: Penggao Gu Managed and preprocessed the animal pose dataset, including original images, pose instructions, and prompt hints. Built the dataloader for training.
Training: Ruoheng Yuan, Yichen Li Converted the PyTorch codebase to TensorFlow. Transferred Stable Diffusion 1.5 PyTorch weights into TensorFlow and initialized the TensorFlow models accordingly. Designed the TensorFlow DDIM model for ControlNet training and implemented the training and validation steps.
Inference: Muchen Zhong Implemented the DDPM sampler for inference and image generation. Monitored ControlNet behavior throughout the training process.
References Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., & Van Gool, L. (2017). Pose guided person image generation. Advances in neural information processing systems, 30.
lllyasviel/sd-controlnet-openpose · Hugging Face. (n.d.). https://huggingface.co/lllyasviel/sd-controlnet-openpose
xinsir/controlnet-openpose-sdxl-1.0 · Hugging Face. (n.d.). https://huggingface.co/xinsir/controlnet-openpose-sdxl-1.0#replace-the-default-draw-pose-function-to-get-better-result
JFoz/AP10K-poses-controlnet-dataset · Datasets at Hugging face. (n.d.). https://huggingface.co/datasets/JFoz/AP10K-poses-controlnet-dataset#dataset-card-for-ap10k-poses-controlnet-dataset
Log in or sign up for Devpost to join the conversation.