Especially in the last few years, real-world photo editing with non-trivial semantic corrections has been a fascinating challenge in image processing. In particular, the ability to control an image with just a short textual prompt in natural language would be a disruptive innovation in this field.

The current top methods for this task still have various drawbacks: first, they can usually only be used with images from a specific domain or artificially created images. Second, they present a limited set of edits, such as drawing on the image, adding an object, or transferring a style. Third, they require additional inputs in addition to the input image, such as image masks indicating the desired editing location.

A group of researchers from Google, the Technion, and the Weizmann Institute of Science have proposed Imagic, a semantic image modification technique based on Imagen that addresses all of the aforementioned issues. Their approach can perform complex non-hard edits on actual high-resolution photos with only a single input image to be modified and a single text prompt specifying the target edit. The source images are well aligned with the target text and support the background, composition and general structure of the source image. Imagic is capable of many changes, including style adjustments, color changes, and adding objects, in addition to more complex changes. Some examples are shown in the figure below.



Given an input image and a target text prompt that describes the edits to be applied, Imagic’s goal is to modify the image in a way that satisfies the given text while preserving the most detail.

More specifically, the method involves three steps, also shown in the figure below:

  1. Optimizing text embedding. An initial text encoder is used to create the target text embed etgt from the target text. The generative diffusion model is then frozen and the embedding of the target text is optimized for some steps, obtaining echoose. After this process the input image and echoose match as close as possible.
  2. Fine tuning of diffusion models. When subjected to the process of generative diffusion, optimal embedding is obtained echoose may not always result in the accurately entered picture. To close this gap, the model parameters are also adjusted in the second stage while freezing the optimal embedding echoose.
  3. Linear interpolation between the optimized embedding echoose and the embedding of the target text etgt using the model fine-tuned in step b, to find a point that achieves both image fidelity and target text alignment. The generative diffusion model is used to apply the desired edit by moving in the direction of the target text embedding as it is trained to fully reconstruct the input image at the optimized embedding. This third stage, more precisely, is a simple linear interpolation between the two embeddings.


The authors compared Imagic with state-of-the-art models, showing the clear superiority of their approach.


Furthermore, the ability of the model to produce different outputs with different seeds starting from the same input image and text prompt is shown below.


Imagic still has some drawbacks: in some cases, the desired edit is applied softly; in other cases it is applied effectively but affects the outer details of the image. Regardless, this is the first time a diffusion model has been able to edit images from a text prompt with such precision, and we can’t wait to see what comes next.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Imagic: Text-Based Real Image Editing with Diffusion Models'. All Credit For This Research Goes To Researchers on This Project. Check out the paper.
Please Don't Forget To Join Our ML Subreddit

Leonardo Tanzi is currently a Ph.D. Student at the Polytechnic University of Turin, Italy. His current research focuses on human-machine methodologies for intelligent support during complex medical interventions using deep learning and augmented reality for 3D assistance.

Latest Artificial Intelligence (AI) Research At Google Presents ‘Imagic,’ An Effective Technique Based On Diffusion Models To Edit Images With Text Prompts

Previous articleThe West Midlands Digital Innovators Skills Program has launched
Next articleHow can animal-assisted therapy help you live a less stressful life?