Caution
24 Aug 2024 - Please keep in mind that this document is obsolete and lots of things have changed since its release, this document is not maintained in terms of technicalities. Most things still apply as I still do things with the same workflow, except the settings at this point are a "whatever" and I use Hypertile with the recent A1111 releases (switched to 1.10), so the samplers/CFG etc. settings don't really matter unless you use things like XL Turbo or whatever people release today, use whatever is recommended. Also "enhancers" in prompts are just a meme.
Table of Contents:
- Stable Diffusion, Subject-Oriented Inpainting
- Pre~ stuff and lots of warnings
- Inpainting Introduction
- Types of Inpainting
- Inpainting Settings
- Prerequisities
- Conclusion - Before Workflow
- Workflow explanation
- My workflow - intro to Subject-Oriented Inpainting
- Conclusion
- Personal note
This was supposed to be a quick guide + workflow documentation, but I changed my mind midway and decided to just explain what inpainting options and settings actually do, because it is important to understand how to use them, but it will mostly be just me rambling about myself being biased and complicating the actual workflow, even though people can do something close 10 times faster nowadays (or even in mere minutes).
But hey, it's not about me teaching you how to inpaint, but about showing you alternative solutions to some problems, which is why you should not follow most of what I do if you don't want to end up having a headache by trying to do something in a way I work, so almost all of the information should be taken with a grain of salt. That being said, my workflow and this document might not be for everyone.
As AI is progressing super fast, some of the information may be incorrect and outdated, because I'm working with AUTOMATIC1111 1.3.0
version, because they break stuff with each update, so ¯\_(ツ)_/¯. The current version is 1.5.1
at the time of writing this, so some settings may be different.
Important! This document is not beginner friendly. Assuming you already have some basic
AUTOMATIC1111
knowledge, you already have it installed, you know how it works and how to make basic artworks with ControlNet. I will also refer to A1111 asStable Diffusion
(or SD), even though they are two separate things, I know.
Warning: this document will be super long and messy (around 80-90 minutes read), but unfortunately it is required to read it in its entirety, as this process is a bit more complex than traditional image generation and the sections contain information somewhat related to each other. It might contain some typos, some weird stuff and I might repeat myself a bunch (a lot actually), no time to double-check all of that. This is not some research paper, just a regular blog-post-style document.
If you came here for Anime style or NSFW, you can click off, I don't do this type of content and it might not work here. Although, you might still apply some of the information, but I'm not gonna test this on anime models.
Do not take this as an ultimate, absolute and definitive Inpainting guide, because there simply is none and there will never be. Your Stable Diffusion will behave differently. Most of the information should be treated as a suggestion, as working with Stable Diffusion is just constant experimentation through frustration.
This document contains some controversial takes on Samplers and Upscalers, and basically on everything else too, so I might be a little biased, especially don't listen to my advices on prompting, because I obviously can't test things on all models
Warning: this is not a simple "mask-and-generate" process. Highly detailed images can take you up to 2 days of intensive work, depending on the resolution, complexity, your understanding how to guide SD into inpainting and your machine capabilities. It is ideal for your PC to have at least a GTX 1070
, which has 8GB VRAM and is somewhat okay in terms of speed, but still, better machine cuts the work time by a lot. Generating the target 1024x1024px
or 1280x1280px
with ControlNet can take even up to 90 seconds on weaker machines, which can be annoying and make this workflow less rewarding due to resolution limitations. My workflow is not slow-machine-friendly. Just saying.
Also warning: this document is not about fixing hands / feet. For that, you have to research somewhere else :)
And also warning: this document provides information about super-detailing generic 1080p wallpapers (in most cases you can go higher anyway, depending on the complexity) and fixing artifacts on existing images. That's because I'm creating medium-sized wallpapers and assets for game development (2D), where raw scale matters.
Side note: I know, that actual Game Artworks, mostly Splash Arts are 2x4K (twice the 3840 x 2160 res, either V or H orientation), but that's only suitable for printing. I don't intend on creating printed artworks, so I don't need to go above 1080p as there is no benefit in creating huge images in a longer time frame.
As an another side note for game textures, this is an exception, because 4096x4096 textures can still be inpainted no problem - Unwrapped textures occupy small areas, which can be enhanced through this workflow. Actual Artworks is a different story. It is still useful for huge images, where you only need to fix small parts.
I don't care about the Ultra HD 8k TV-size images. If you want to create such absurdly huge images, this document will be useless to you, although, some parts of my workflow can still be applicable to fix irregularities. Other than that, a simple upscale will be enough as most people can't even tell the difference between regular AI-generated 1080p and downscaled UHD images - and that's basically it. But, if you want to push the details to the limit, I highly recommend experimenting with inpainting.
If you are more focused on fixing faces, just use After Detailer
extension in A1111.
Main purpose of this workflow:
- regenerating ugly over-sharpened/blurry artifacts
- introducing properly shaped details
- increasing the level of detail
- make it really hard to tell the image was AI-generated
- have complete control over the entire image (to some extent)
The problem is, people are lazy and they want to see ultra quality immediately with one click solutions. Everyone thinks you can automate everything, because it's AI and it should handle it for them. Well, I got bad news for you :)
Less effort = bad results. More effort = greater reward. Stop being lazy, you people.
Before actually getting into Inpainting, there is A LOT of information to process if you want to use it efficiently, because it's not as simple as just masking things and re-generating them.
Inpainting is an alternative form of img2mg, allowing you to change certain parts of the image - applying a mask.
This is a very powerful tool, that not only helps you with fixing just the specific part of the image, but also drastically increase the level of detail, which is simply impossible to achieve with any kind of currently available extensions or "prompt engineering". Yes, even with all the next generations of Midjourney, but that comes with a cost of huge time investment.
Armed with lots of patience, practice and experience, this method allows you to create professional artworks yourself as a non-artist, meaning your images will highly unlikely to be deemed as AI-generated, that is, if you put enough effort into your image and you know what you are doing.
Unfortunately, you are pretty much limited in this workflow. The level of detail highly depends on the resolution your GPU can handle. If you can generate 2048x2048 or just 4 megapixels at maximum (should be around 16GB VRAM with controlnet), you will be able to create at least 4K UHD artworks (3840 x 2160px
or 4096 x 2160px
).
You shouldn't be aiming higher anyway. Let's be honest, you don't really need images above this resolution, 4k is definitely enough (do not confuse with 4096x4096, it's not the same). You can of course scale this infinitely with outpainting, but that's a bit more complex on higher resolutions. There's way more to explain here, but it's out-of-scope of this document.
Now, why did I mention, that it's best if your GPU can handle at most 2048 x 2048px
resolution? This will be explained with more detail in the actual workflow, but the general concept is, that you will be working in Only masked
mode of the Inpainting, which upscales the masked content. This is important, because if you really want to create huge images, you will need a stronger machine to do this efficiently, so you are limited by the speed
and vram
of your GPU.
The main purpose of this document is to guide you through the basics of high resolution inpainting, revealing the secrets of super-detailing, rather than giving you ultra quality images straight away. That's not how it works. This is simply not possible (yet) and to prove it, open any AI-generated image in its raw scale
Everything comes with lots of practice based on your time investment, working through the pain of AI being extremely uncooperative, learning to solve problems, exploring the details no one talks about.
There are different forms of Inpainting, which might be a little confusing and they all result in developing a different workflow, depending on your needs.
You can add stuff, remove, paint over image, regenerate it, change something completely, etc.
I will quickly go over the different tabs and the settings, as there is something important to explain...
The Sketch
tab is only useful if you need to paint over an image in order to "remove (*)" something from the image or quickly correct some lines before regenerating the image in img2img
tab or before sending it back to the Inpaint
.
Removing, meaning you have something like clothing with unwanted patterns, you sketch over them with the colors of the clothing and then regenerate to correct the image.
As a side note, removing objects is more tricky, because you have either have to be accurate with your sketching over the unwanted objects or play with token weights, latent noise or Inpainting version models. Here I chose an Inpainting model, because this one worked just fine, it was not required though. Everything depends what's already on the image, so the generation accuracy might be low!
Sketching your own objects onto the canvas:
Think of this tab as of integrated MS Paint into your webui :) basically the same thing. You can generate straight from this tab, but it will regenerate the entire image. You can use the buttons below your input canvas to send to Inpaint or other available tabs:
This is the main tab I'm using in my workflow, so this is where all the magic happens :D
The main Inpaint
tab allows you to select (mask) a part of the image and let SD regenerate it.
Example:
That's basically it.
Now this tab is a bit more confusing, because it has additional settings: mask transparency, which requires some experimenting in order to make this work (it will work differently on every image). I never seen an actual use of this, so I will ignore this option. It can be useful for special effects, though.
Inpaint Sketch
allows you to both Sketch and Mask
at the same time.
To understand this tab better, think of it like this: you mask a part of the image, but the color of your mask is actually important here. Stable Diffusion will try to generate whatever is under the mask, which is also just a paint-over. The more accurate your sketch is, the better. Works best if you are able to draw highlights and shadows correctly and with somewhat visible detail. The effect is even greater if you combine it with ControlNet, but I won't be doing this here, as it's a different story.
Just remember to pay attention to the tabs. If you accidentally use Inpaint Sketch
instead, you will have a bad time :)
Stable Diffusion will try to generate a black hole on your image, so that's an important thing to keep an eye on.
Here, I will try to quickly sketch something inside the Inpaint Sketch
tab with grey color and some orange nonsense:
All of the following will now relate to
Inpaint
tab, as I never use other tabs.
Stay on Just Resize
, you will almost never use the rest of the settings, unless you just want to do some cropping.
Note on
Latent Upscale
- this does absolutely nothing, it's as same asJust Resize
. I haven't found any information on this and after some experiments, all my images were absolutely identical, so there's no reason to use this at all.
In pixels, describes how much the your drawn mask will be blurred before processing.
Most of the times, you won't be using this option, but... it's still important to notice, that the default 4px
value will be too high for some images.
Consider the following image:
Let's say you are trying to change that tree into something else, like... steel pipe, I don't know. On this sample, the tree is 6 to 8 px
wide. Masking this tree with default settings will result in basically nothing, because the edges of the mask will get blurred so much, that Stable Diffusion will have no mask at all. This mostly applies when masking thin wires, fingers or some thin/tall objects in general.
Now, you might think, that high blur would result in very well-blended image. Well, wrong. You want to keep it at 4, 6, sometimes 8px (mostly just 4px), because high blur will actually result in a very bad quality image on the mask edges very often.
For science, 64px mask blur:
Notice how awfully visible the blur. Obviously, this is a bad example as this image is small, but trust me, the effect is just as bad on big images, especially on complex backgrounds. I never switch from 4px anyway, and I'm completely fine. The mask blur around 10-20px can make sense when inpainting faces, but there are better solutions now, like face swaps.
ALTHOUGH...
There is one use of mask blur, mostly when you work with very simple backgrounds, like sky, ocean, etc.
On this example image I inpainted the clear part of the sky, "overshooting" a lot on the darker clouds, so the blur can do its thing. While this works, the blur will still be visible, so for me this is useless, because the blur will look ugly in raw scale.
This can be good when prototyping - quickly inpainting small image for upscaling. Other than that, nah.
There are two modes available:
- masked
- not masked
By default, your Stable Diffusion will be in Masked
mode. In this mode, Stable Diffusion will generate a new image on your masked area.
Not Masked
mode is the opposite of Masked
. This mode will generate a new image ignoring the masked area. This mode has its use when you want to change the entire image, except the masked part. For example, you can take a photo of a person, mask the face only and switch to Not Masked
mode. This will make Stable Diffusion regenerate the entire image except the face.
Here's an example: masking the table and regenerate the surroundings - this can be inaccurate, though.
Ignore the incorrect lighting and other artifacts, it's just a simple example.
Masked Content will tell Stable Diffusion how to generate a new image on the masked area. I will kind of side-track here as I didn't intend on explaining these options, but it's good to know what they do. Sometimes you will have to use them anyway.
This can also be used with
Not Masked
mode and it works pretty well!
fill
- this option is pretty confusing, in my understanding, it fills the masked area with whatever content, that will adapt to the mask edges according to the general average color composition (color values???) of the entire image (or context), but ignoring what is under the mask completely. This should regenerate the masked area with something new, that "fits in". This is a bit different thanLatent Nothing
, because this initially fills the mask withaverage color
or rather the color you can see after blurring the whole context to the max, whileLatent Nothing
is always filled with a constant solid color value. You can approximately achieve a similar effect by selecting throughLasso Tool
, feathering the selection with whatever your mask blur is in A1111, inverting the selection (ctrl + shift+ i
), applying Average Blur throughFilters -> Blur -> Average
, copying the color (eyedropper), revert filter and selection inversion and use Paint Bucket to fill selected area with that color. There's more magic happening, because in early steps you can see the surroundings blurring into the mask. All of this will happen in reverse if you are doingInpaint Not Masked
.original
- this is the only option, where the mask contents are not filled with noise/solid color. The masked area will be initially ignored and the original input is used, then the denoise from current seed will do its thing from there.latent noise
- there's a deeper explanation of latent space, but in the context of this document it does not matter, that it's some kind of compressed data of the currently selected model. Stable Diffusion will fill the masked area with noise generated with current seed (or offset by the seed, not gonna dig through the code), which will then keep transforming the image according to the data it was trained on + the prompt.latent nothing
- this will fill the mask with some sort of a solid color, which apparently is a blend of your source image colors, depending how big the available context is. This is mostly visible on low denoise (below 0.6). Although, the results are similar tolatent noise
on high denoise strength, this option creates more "soft" results. Both options are somewhat good for regenerating non-masked areas!.
Comparison of Latent Noise
and Latent Nothing
on the same above image, except this will be generated in Non Masked
mode to make a different background for better visualization. Think of it as: latent noise
is a more aggressive - can over-sharpen and produce weird details/things.
Latent Noise:
Latent Nothing:
In general, you would be using these two options to generate something new on the masked area.
IMPORTANT:
Fill
andOriginal
can also be used in this situation to add objects anyway, except Latent options tend to work better even when the entire context is a whole image (when adjusted correctly!!!), while Fill/Original start to work when there is less context, like when generating inOnly Masked
inpaint area
Fill / Original / L. Noise / L. Nothing
- on Only Masked:The above worked, because
Only Masked
upscales the processed image to the target size, giving more space for SD to work. This changes whenWhole Image
is used and the mask is smallIt is also important to notice, that everything depends on the m^ask shape, blur and resolution. Obviously, when generating something like
a bar stool
on the grass, when there is almost nothing that resembles the prompted shape on the image, the resolution is too small or the mask is too blurred, the generated image will just be whatever fits the masked area.That being said, if you don't adjust your settings, you will just encounter the following, not knowing what's going on (all methods, like on the above image):
The main differences:
Fill
andOriginal
will produce similar resultsLatent Noise
andLatent Nothing
will produce similar results- on almost 1 denoise, all of them may or may not be similar to each other
Just a random example of adding something onto the scene. I used 0.9 denoise and asked for a coffee machine
:
All of the above was tested on Inpainting models, as it should be. Regular models are not specialized with coherent mask filling (and again, it depends).
To show how denoise strength affects the result, below are two plots of Masked and Non Masked modes
Conclusion? You need to do more testing on your own models and settings, because everything comes down to the fine-tuning.
This is the main juice and it is super important to understand these two options!!!
This will tell Stable Diffusion how to treat your input image along with the masked area.
This option will take your entire image into the processing, but only the masked area is regenerated. This will give Stable Diffusion the context about your entire image, so it can fit something in a way, that the result is more coherent with more matching colors.
IMPORTANT: this option is only useful on small images, when you are preparing your base image for proper inpainting or upscaling, as it requires your entire picture to be sent into the processor. If you have a high resolution image, you will never use this option, because it will be generated at the same size (or bigger, if you set it that way).
To visualize the problem of context, I will use the same inpaint mask from above with the coffee maker.
Whole Picture
vs Only Masked
In this example, I used Latent Noise
as masked content to generate a coffee machine on the table, you can imagine what the mask size was - roughly the size of the coffee machine you see in the images.
Now, the huge difference here is, that Stable Diffusion has the context of the entire image, meaning the generated image will not only fit more in the input image, but (I think) also it has way more understanding of the surrounding area around the mask: an empty room illuminated by the sunlight through the window to the left - you can tell by the highlights on the coffee machine.
Now, looking at the image from Only Masked
mode, you can just tell it's out of place: badly lit, shadows will be mostly incorrect (plot later), the colors are off/dull.
To make the comparison easier to visualize, here's a plot of both options.
Spoiler alert: the prompt has nothing to do here, so some silly prompting for "sunlight from left" will not work. It does absolutely nothing :)
Whole Picture:
Only Masked:
Keep in mind this is not a real world example, normally you would experiment with the settings and mask shape!
As it was already somewhat explained above, I will just add, that you will use this option 99% of the time. As to why, I will explain in the workflow.
Speaking of the image context. When using Only Masked
mode, you are not limited to the context Stable Diffusion calculates based on your mask shape. You can specify how much padding to insert into the processed image around the mask.
Below is a rough representation of how Only Masked Padding
will look like - this will also result in changing the mask size while processing!
Each white square shows how much context Stable Diffusion is given - except the first inner one, this is the bounding box calculated from the mask shape, which will eventually be expanded outwards if the padding is specified as long as it's still within the input image boundary.
It is also important to understand how mask's bounding box is calculated. Stable Diffusion will first create an image of the size, where the outermost points of the mask form a rectangle (colored in red):
Now, what does that mean?
Let's assume, we have set a size of 512x512px
in our inpaint settings and our calculated mask is 312x283px
. First thing Stable Diffusion will do, is resizing the mask to match target size aspect ratio: 1:1
, because we have a square and it will do it by automatically calculating the padding, which is a resize-to-fit
(center of the processed image, resized to the first X or Y boundary):
Width: 312px
Height: 283px
// The image is wider than taller!
VerticalPadding = (Width - Height) / 2
VerticalPadding = (312 - 283) / 2 = 29 / 2 = 14
Remainder: 1px
Since the size is odd, we have a result of 14.5, which gives a 1px of padding, I assume the top takes precedence in this situation, so we will have a vertically padded mask 15px from the top, 14px from the bottom:
The calculated padding is additional context, automatically inferred by Stable Diffusion!
The mask shape does not have to be an actual single defined shape, you can mask disconnected shapes, it will work anyway.
This is how you do custom context - just place a dot anywhere on the canvas to extend the processed image. By doing this, you give Stable Diffusion more context about the surroundings, which helps generating content that actually makes sense, otherwise the generated image will just feel out of place. This is the most important part of the inpainting!
Now, having our settings in mind, the 312x312px
image will be generated and upscaled to 512x512px
, then downscaled back to 312x312px
and placed back into the original position. The next section will explain why this is important.
This is the detail no one talks about and this is the main topic of this document: when in Only Masked
mode, the masked area will be Upscaled to the target size (*).
Again, very important piece of information: the masked image is Upscaled.
When inpainting, Stable Diffusion will not just resize, it will upscale the masked image using img2img upscaler model. With this in mind, remember, that your selected upscaler model is important, but don't worry about this for now, this quicksetting will be explained in the workflow.
The only exception is Inpainting with Tile Resample model. Here, the effects of different upscalers are limited and almost unnoticeable, although, they can still affect the tone.
What is the actual relation of the mask to target size? Since we know Stable Diffusion will Upscale the masked image, we can take advantage of Subject-Oriented Inpainting
- which, in short, means masking all subjects one by one (or area by area), regenerating them in a way higher resolution separately.
What does this actually do? I don't want to spend 2 days on one image!!!!!!!
I know what you are thinking. You could generate an image, upscale it 8 times and downscale back, right? Use Latent Couple, Segmenting or Ultimate SD Upscale to achieve the same effect? Well, you are wrong :)
Just to prove my point, here's a random image I generated in 1920x1088 resolution (there is no 1080), upscaled 8 times with a regular upscale through Extras and the same one with Ultimate SD Upscale
then downscaled back (left/middle part).
I also took a small part of it on raw scale (initial image), then quickly inpainted different areas (min. 0.9 denoise + tile resample) + corrected the colors in Photoshop on the result image. This is just one pass over this image and it already looks 10 times better, but requires further fixing. Comparison between other methods:
- generated 1080p image -> regular upscale 8 times, downscaled back to match the original
- generated 1080p image -> Ultimate SD Upscale 4x (8 was an overkill, looked worse), downscaled back
- manually inpainted, no final upscaling
Model: Realistic Vision 3.0
On further inspection, my mistake on incorrect model on top-left of my image (switched in-between the tests), but i decided to left it out. That's why I mentioned one pass, I reiterate images later to correct everything anyway
Notice, that upscaling images regularly will introduce different kinds of artifacts: ugly over-sharpen, blur, unwanted noise, squiggly lines, bad shapes (depends on the upscaler used, don't use 4x UltraSharp
btw, it's somewhat good for inpainting - still barely anyway). Also, the test 8x upscale made very little difference to 4x, so I just left it there. The maximum size, for Upscale -> Downscale
process should be at most 3 (only during Inpainting).
Why does the bad quality happen? You are upscaling your images with a general-purpose upscaler, which is a lazy solution for poor quality images. Inpainting smaller areas solves this problem by properly upscaling masked parts of the image, which can be fine-tuned through specialized upscaler models, then inpainted again with tile resample (I don't do this part, because I'm working on strict compositions and I don't need the specialized upscalers at this phase).
When inpainting, small objects should be upscaled at most 4 times, big objects: 2 - 3, because objects upscaled more than 4 times will lose a huge amount of detail, resulting in blurry images, which defeats the whole purpose of this process. That means, if your masked object is around
256x256px
, you should only upscale it to1024x1024px
, no more. You will only be shooting yourself in the foot most of the time, depending on the denoise. Increasing the resolution should only be used with bigger padding on small objects to increase the coherency of result images (surrounding context). In this example, it can be good to upscale around 4 times, adding a bigger padding, like 64px, then increasing the resolution to1280x1280px
. It's ok to overshoot by a couple %. More context = more sense, but less work space.
Now, you still might be asking: what's the point of all this?
Warning: controversial takes below, read at your own risk.
The reason why I'm pushing the inpaint-everything-separately
workflow, is that I really hate how AI images look in both raw scale and zoomed out, and no one uses absurd 10000 pixels wide images anyway - it's the raw scale that matters and mostly because assets I create are displayed in that scale - game assets, loading screens (depends on the rendering method, downsampled assets will look bad, that's why I only prefer the images to look good in raw scale). If the generated images look good enough to you, use whatever upscaling method you prefer. Just upscale the image 8 times, downscale it back to desired resolution. This kind of quality is unacceptable to me, though. You think you could trick someone by absurdly upscaling and then downscaling. 😉
This is simply because some people actually do Hires. fix + Tile Diffusion
-> upscale 8x (small image), -> downscale 4x. This still does not look good enough to me, because I can immediately spot unnatural shapes, downscale blur, over-sharpened downscaled noise, generic noise and very specific noise patterns. Those images will only look good as thumbnails, because most of the information is being lost and only a small portion of the AI artifacts will remain - but hey, this is just my preference, such images are useless in production, because to me it immediately screams "bad quality".
I could upscale my final images to 10~k+px
and then downscale to match 2x origin resolution
, but they just become over-sharpened, which throws the entire work into the trash.
So, now you will probably ask: but wait a moment, you don't look at 10 Mpx pictures at raw scale, that's the purpose of true high resolution. What are you even on about????.
True. If you paid attention: my ultimate goal is to make relatively crisp, and emphasizing, good looking images in AT LEAST 1080p resolution with not a single trace of AI-generated artifacts. No stupid noise, no blur mixed with over-sharpened areas, no unnatural shapes (unless it's something very specific). I don't care if you take an 8000~px image and show it to me in 512x512 resolution - this means absolutely nothing to me.
Although... you can still apply portions of my workflow on top of a massive image as long as the artifacts SD created are relatively small, e.g. You have a 10240x10240px
upscaled image of a scene and there are some really disfigured objects in the background, which take enough space to still let SD upscale the masked part to re-introduce the details/fix the shapes.
You could still argue, that you could upscale an image above 10k pixels and just downscale it on Bicubic Smooth to get rid of the over-sharpen. This is actually wrong and I can almost guarantee you, that it just does not work on complex shapes and details. It only works on people, mostly portraits, but that's not what this workflow is for. There's like... 5% of the content, where this could look really good, like super minimalistic backgrounds or mentioned portraits and straight up NSFW, but these are obviously incomparable - or just anime, but enhancement is completely unnecessary, regular upscale on R-ESRGAN anime
is completely file.
Latent Couple
inpainting is still possible (or img2img rather), you can mask multiple areas and generate them at once, but it's too annoying and you lose the working area and context the more subjects you mask, you can't apply separate upscaler for each masked part - if you don't really care about the limited upscale value or quality loss, you can just play withLatent Couple
instead to cut the work time by a good amount, but it's useless on big images unless you have smaller parts that fall into similar category your upscaler can handle. It's only useful on small images with limited subject types, so you can't unlock the full potential of detailed inpainting. It's good for prototyping phase, useless in production phase (if you are fine with mediocre quality that is).
Example of small scale Latent Couple enhancement:
While this might work on smaller scale, you are still limited by single upscale value for all subjects and single upscaler model, which can result in unpredictable images per mask and low quality. The generated image was made with ControlNet's Tile Resample, which is required in this workflow to retain the same composition. The goal is to not make an okay image, it's about pushing the level of detail to the limit (with enough time investment of course).
If you are still doubting, here are two comparisons between Tiled Diffusion
and my workflow:
- https://imgsli.com/MTkwMTcw
- https://imgsli.com/MTkwMTY5 - testing on a seamless texture
Disclaimer, I am not aiming for realism, so don't expect ultra HD impossible super-realistic results. My style is Stylized-Fantasy, but with realistic models this can still work if you put more effort and experimentation into it.
There is quite a bunch of stuff you need in order to utilize this workflow, especially if you want to make it more efficiently.
They help in some way, but please keep in mind, that this is not a solution for "instant perfect images", that's not how it works.
There's also Bad Hands embedding (bad-hands-5 or something like that), but it doesn't do much anyway. I only use Deep Negative, it's more than enough.
You can now also browse Civitai
and just check the negative embeddings
tag or Textual Inversions
for more negatives. I'm using pretty outdated ones, but they still do a decent job.
I only use one, same negative prompt in all my artworks, because it doesn't change much anyway beyond that.
Switching the token position in your prompt will result in different composition, quality won't change, but that depends...
monochrome, simple background, NG_DeepNegative
Yes. That's it, you don't need more.
NOTE: I change names of all my embeddings and LoRAs, just a personal preference, keep this in mind when copying my prompt!
Here's a shocking discovery: if you ever stumble upon someone using 200+ tokens long negative prompt
with DDIM sampler(!!!) - or if you are also using DDIM, well, all I have to say is: they won't make your image better. All tokens above 75 (after the first chunk) won't even work and you are perfectly fine with using just the NG_DeepNegative
embedding with some highly scored tags (in e621), like monochrome
, simple background
(those two tokens are absolutely enough). If you want to take advantage of the accumulative quality increase from dozens of tokens + negative embeddings, you have to switch to other sampler of your preference. I use DDIM, because it already looks good with small amount of tokens. Also...
Keep in mind, that all those weird negative tokens you see all over the place don't have a big impact on your generation and might even do nothing at all. Things like username, watermark, logo
- they mean nothing as there is no concept of such things in the models unless you have a specialized negative embeddings trained on texts/logos. There is a placebo effect of adding such tokens or things like "duplicate", giving you a feeling that it works, but the reason why it works is because changing tokens can result in changing the composition. It won't magically create identical(!!) image with username
token in the effect, removing usernames/signatures/etc from the image. If it does, it just means the composition has changed and the text doesn't fit the image - which happens mostly when darker parts of the image with white text have been changed to bright.
Then again, it depends how the model was trained.
Now, if you are more curious about the DDIM Sampler
and +75 tokens
, check this out. I will use a 236 tokens long embedding vs 75:
// 75 tokens - same prompt, just stripped down to the max length of the first chunk
NG_DeepNegative,3D render,aberrations,abstract,
/// 236 tokens
NG_DeepNegative,3D render,aberrations,abstract,anime,black and white,cartoon,collapsed,conjoined,creative,drawing,
extra windows,harsh lighting,illustration,jpeg artifacts,low saturation,monochrome,multiple levels,overexposed,
painting,photoshop,rotten,sketches,surreal,twisted,UI,underexposed,unnatural,unreal engine,unrealistic,video game,
blurry,boring,close-up,dark,details are low,distorted details,eerie,foggy,gloomy,grains,grainy,grayscale,homogenous,
low contrast,low quality,lowres,macro,monochrome,multiple angles,multiple views,opaque,overexposed,oversaturated,
plain,plain background,portrait,simple background,standard,surreal,unattractive,uncreative,underexposed
Now here's the comparison: https://imgsli.com/MTkyNDY3
The differences are only caused by --xformers
- it's non-deterministic.
To actually make use of your negative prompt, pick tokens that fit into the first chunk (those, that actually make sense, excluding useless tokens like username
, etc.). Either combine NG_DeepNegative
with a couple tokens or make styles preset with tokens you really need. If your negative embeddings fall out of the first chunk, they will do nothing, but it depends where is the starting index of the negative embedding in your negative prompt, on 74 it still works, but the effect is very minimal. I won't dig deeper into the DDIM sampler why it does that, but there's some sort of token
Then again, this applies to DDIM sampler, everything works as expected on other samplers.
I wanted to avoid this section, because every model can have different prompting - people train their models differently. The biggest difference is on anime models, they are mostly trained on booru tags - keywords-based, unlike regular models, which can work with sentence-like prompts.
Models can have trigger words, for example, aZovya, which I'm using, has a zrpgstyle
trigger word and has to be present early in the prompt or with higher weight. It doesn't mean the model won't work, you put more emphasis on the dataset it was trained on (at least that's what people claim).
I use the following prompt with my selected model:
This is the Inpainting prompt structure, which is absolutely enough
zrpgstyle, detailed background, hi res, absurd res, ambient occlusion, (stylized:1.1), masterpiece, best quality,
<my_subjects_here>
,<lora:AddDetail:1>
No, it didn't really matter where the subject was located. Also, I don't use Artists' names or Artists-related LoRAs/Embeddings in any of my artworks.
detailed background, hi res, absurd res
- these were just the highest scored tokens related to the quality, although, I'm pretty sureabsurd res
is just one of those unfortunate placebo tokens, which do nothing but change the composition and actually has nothing to do with high resolutionambient occlusion
is most likely a useless token, along withmasterpiece
andhigh quality
, but I'm leaving them there just in case they work with some prompts or other tested models. Doesn't hurt to leave them there, they are in the styles file anyway.
I split my prompt into 3 sections:
- trigger + enhancers
- subject (thing I'm inpainting at the moment)
- LoRAs
The structure will change from model to model or over time, so you might not get the desired results with your models
A bonus plot of absurdres
token:
I only use one LoRA when detailing, which definitely helps to some degree, but it might require playing around its weight. I'm using the Detail Tweaker. There are different versions, but this one works just fine.
I can also recommend other LoRAs, which I sometimes use for unique details and tone manipulation:
There are hidden capabilities of Only Masked
Inpainting - since we know Stable Diffusion upscales the masked area, we will take advantage of this fact.
--- WARNING!!! ---
This functionality depends on the most crucial part - if you are working with ControlNet's Tile Resample
, it won't work at all!
Upscaler models do very little in combination with Tile Resample during the Inpainting, because when the source image is upscaled with Tile Resample ON
, upscaler models can not resample image correctly based on how they were trained - this also depends on ControlNet mode, where it kind of has a lower effect when in the prompt importance
mode, allowing the Upscaler to do its thing (partially only!). This does not apply to Inpainting with CNet Inpaint
, so upscalers will affect the generation.
Still, there are cases where Tile Resample can work with upscalers, but it only depends on your source image. Think of it as fine-tuning.
For example, this is the effect I'm looking for (stronger exposure of soft shadows with lower sharpness, check the bottom of the inpainted rock): https://imgsli.com/MTkyNjM1
Below are the comparisons made on a regular checkpoint (non-inpainting version). This is one of the examples, where upscaler models have a very little effect. This can mostly happen, when the source has enough detail already and the upscale value is somewhere below 2. This still depends on the source image, so it doesn't mean it will always work like this. This is just an example, and again, it's still better to get into the habit of changing to proper, specialized upscaler just in case, so you don't have to waste time on checking which one works better on the currently inpainted area.
From my experience, in most cases and when inpainting something very specific with CNet's prompt importance mode
, like Ground or a Plant from smaller scale, it's just better to use specialized upscalers and only change to something else if you are not satisfied with the results.
This is also the main purpose of upscaling: you take a small image of bad quality, upscale it with specialized upscaler, not a general purpose upscaler.
There obviously are places, where this doesn't really matter, like inpainting distant objects without Tile Resample to enhance the image and use the upscaler https://imgsli.com/MTkyNjU1, so this mostly might just be the matter of preference, I'm not saying this technique is the best and you should do it this way.
Comparison of 4x-UltraSharp vs ScreenBoosterV2 with CNet Tile Resample:
- Balanced mode: https://imgsli.com/MTkwNzc3
- Prompt importance mode: https://imgsli.com/MTkwNzcx
- CNet importance mode: https://imgsli.com/MTkwNzc0
Control sample with CNet OFF: https://imgsli.com/MTkwNzc4
Comparison of 4x-UltraSharp vs ScreenBoosterV2 with CNet Inpaint Model:
- Balanced mode: https://imgsli.com/MTkwNzc5
- Prompt importance mode: https://imgsli.com/MTkwNzgw
- CNet importance mode: https://imgsli.com/MTkwNzgx
All of the above applies to Original, Latent Noise and Latent Nothing. Tested it, because I wasn't sure. Also Inpainting checkpoint doesn't matter here.
Now back to the upscalers.
I have to throw another info block here - even though I'm only using ScreenBooster V2 upscaler now, I still sometimes change the CNet mode when inpainting or just turn it off to make some tests with
Inpainting checkpoints
orCNet's Inpainting model
. This is the only point in the workflow, where this is very important, because only in this case we can use specialized upscaler models. Other than that, they mean almost nothing - the difference is almost indiscernible, so mostly when working withTile Resample
they do not matter, except when you start lower the CNet's weight. The effect will be a bit bigger, but still not big enough to make a difference. The biggest effect with Tile Resample is when changing toprompt importance mode
and I will repeat this quite a lot, because you might now think I'm contradicting myself by saying to use specialized models for upscaling, but they don't seem to matter at all.They are still important during the
hi-res.
fix (if you use that),Ultimate SD Upscale
,non-tile_resample workflows
, inpainting without ControlNet.
First off, you will need to add the img2img
upscaler component to your quicksettings. Open your settings and find this list and look for upscaler_for_img2img
. I usually just click Show All Pages
and ctrl+f
for things. Type quicksettings
in there.
While you are at it, also add img2img_color_correction
, you will need it later. What it does is, that is will attempt to correct the color tone of the masked image (outmasked or in general when using img2img). It's kind of important when inpainting on mixed contexts, like when you have a rock on the ground in a sunny day, SD will adjust the generated image colors so it can match the temperature, color tone or just apply a general correction so it doesn't look out of place. You won't need it most of the times, only when things go wrong. ControlNet also has a model for this: tile_colorfix, which does almost the same thing, but I can't tell exactly what the differences are between this model and the Color Correction option.
Settings -> User Interface -> Quicksettings list (dropdown list)
It is easier now to add components to the quicksettings, because we have a list of customizable components since 1.3.0 (I think... or maybe 1.2.0, not sure).
I don't use CLIP skip, because I don't need it. If you switch between Anime realistic models, you will most likely need to add this too:
CLIP_stop_at_last_layers
.
Click Reload UI
, so your webui can insert the selected components.
First, open the Model Database for Upscalers.
Here's a list of my currently used upscalers when Inpainting (copying from older guide):
- ScreenBooster V2 - this upscaler really caught my eye, mostly because how well it performs with rendered-like images or anything fantasy-related/stylized. This is a really powerful upscaler for Inpainting and I use this almost the entire time, but that's only because it fits my "style". It does not always perform well with photo-realistic images. This model was apparently trained to upscale Game Screenshots and this is a very important piece of information, because this includes game assets creation, game screenshots manipulation, rendered images enhancement. This somewhat gives the feeling, that soft shadows (or black colors) are slightly more defined (the colors are more coherent(?) - is that a thing?).
- 1x_Plants_400000_G - performs really well with overall foliage, grass and flowers, so this is preference when inpainting nature (depends on many, many variables, though). NOTE: its pre-trained model is 1x_NMKD-h264Texturize_500k
- Ground - I use this for inpainting any kinds of dirt/road grounds or even textures. It works best with any surface that is not grass/moss (but it depends)
- 1x_sudo_inpaint_PartialConv2D_424000_G - General purpose upscaler for inpainting very small objects in the distance. I sometimes swap to this model for background nature/landscapes
- Forest - can work well with any forest-like ground textures or just pictures, that contain leaves, branches, sticks. Also applicable when inpainting something made of wood.
- Face-Ality V1 (4x_Fatality_Faces_310000_G.pth) - I don't always use it, but I like to compare it to other models when inpainting the entire face - it heavily depends on the checkpoint
- 4x-Fabric and Fabric-Alt - this is a very nice upscaler when the inpainted area on any clothing is relatively small, so in comparison to other upscalers it won't do any better on big scale. It still performs well for background clothing/people
- NMKD Siax ("CX") - general purpose upscaler for those, who don't care. I sometimes use this for initial upscale or for testing
In addition, you should really just go through the Specialized Model section and look for any upscaler, that fits your needs.
Notice, that the biggest effect of this step is visible between between 2 and 4 upscales. For example, if you have a
1024x1024px
face and you upscale it with a size of2048x2048px
, there might not be any change at all if you use other upscalers, likeSkin vs Fatality Faces
.
You can see this model being recommended everywhere, because people think this is the best upscaler for everything and it creates ultra high quality images. Well, that is plain wrong :)
For images viewed in raw scale, this model is really bad for general upscaling, but may be okay-ish for inpainting. In most cases this upscaler creates over-sharpened squiggly lines, so use better alternatives, like NMKD Siax (or Siax Superscale). It still won't make it look good, a tiny bit more noisy, but at least it won't be an over-sharpened mess.
As I already mentioned in the beginning, I don't make any kind of Anime stuff or NSFW. I switched to game artworks as NSFW is just plain boring and useless for me.
Models I use:
- A-Zovya RPG Artist Tools
- A-Zovya Art version (for concepts)
There are gazillion of similar and new models being added every day, but I don't care really. This is good enough for my style, all models are just similar to each other, many of them being just a merge of other models.
I also used ""realistic"" models in the past, but I prefer more stylized models. Realistic models are only good for the memes. You can use any model you like anyway. Just remember, that if you still want to use your Anime models with this guide, you will encounter issues mostly during prompting, as most anime models are trained with booru
tags, so prompting is a little different and the result images will also be different. It's just just the prompting anyway, the upscalers used will not even matter at this point.
This guide does not require you to use the same exact model, but it works really well if you want to create stylized artworks.
First trap you fall into is downloading Inpainting models, which might sound trivial, but they serve a totally different purpose.
Take the below base image as an example. I mask the entire tree in the middle of this picture and I want to turn it into a Tree House
I first want to make a test control sample (2nd image), which is just inpainting the masked area on 0.9 denoise strength
.
The second image was generated on a regular non-inpainting version and without ControlNet's Tile Resample
. You can immediately notice a big problem here: the entire shape of the mask, which was the tree, does not really fit the base image - compare to the 4th image with Inpainting version of that model. Now, why is that?
That's because regular models are not specialized in filling the mask in regards of the background it was inpainted on - the background on regular models will be very generic or filled completely with what you prompted for, which will just just fill the mask with your prompt in a way, that it will just look out of place no matter how big the given context is.
Hold up, you will now say: hey, but your prompting sucks, that's because you have to be more descriptive, you masked only a small part inside of the tree!
Yes, the purpose is to show how they blend in. OK, let's try this with some random tokens:
masterpiece, high quality,
tree house, house in the tree, tree with a house,
soft lighting, detailed background, tree leaves, volumetric lighting, ambient occlusion
Comparison: https://imgsli.com/MTkwMzg1
Ignore the obviously visible mask, this is the effect of poor quality base image with complex background. Normally, you would inpaint the entire area to hide this
- Wait, hold on, but you are trying to add something new, you are supposed to be using
LATENT NOISE
for that!
No, the shape remains the same, the edges are just way worse, mask blur won't even help you here.
Why are there two models then? To get to the main point: Should I use Inpainting model?
- normal version models are good for changing the masked area, which can be unpredictable, they are even better on super high denoise with ControlNet's Tile Resample in order to introduce higher level of detail. They can still be used for adding new objects with
Latent Noise
, except you have to waste time on fine-tuning the settings, mask shape, blur, padding. Just not worth the time - inpainting version models are specifically made for adding/removing objects. They can still be used in regular inpainting, but they tend to produce softer color palettes and slightly more blurry results (no, sharp upscalers and tile+sharp won't help you here)
To give you a way more simple explanation, I will show you a very stupid example, like inpainting a cliff into the grass with a river and a bridge. This will also be a plot of two models with a denoise range to show you how it changes. I picked a 0.65
denoise as a starting value, because in some cases this is where the big image changes can start appearing, depending on the prompt and the original content. It's mostly 0.5, 0.65 and 0.75.
If your masked area has some details, complex shapes or anything resembling the prompt in some way, changes will appear very early in the denoise range! On lower denoise the bad background blur won't be visible, obviously, but that's not the point of inpainting to fully represent what you are trying to achieve.
This is the image I will be using as a base, and how I will be masking it.
Now, you might say: hey, that's completely fine, both are nicely blended, what your point? What about lower denoise anyway?
No, they are not nicely blended. In a real world example, regular model will always give away where the mask edge was, which is completely unacceptable for my standards. On a smaller scale you won't see the difference, but that obviously is not the point. Inpainting model will also create a softer images, but inpainting models
Here's the thing: in Stable Diffusion, everything depends on the context (and how big it is), settings, source image and tons of other tweaks. I won't tell you I'm 100% sure, that with [x] setting, this thing will be the best, it will work like this, you need to use [y] this way. People who tell you that, definitely didn't experiment enough to confirm this. I spent many hours plotting and fine-tuning - I still can't say if there is an absolute sweet spot with certain settings, that will work every time. That's not how it works here 👎
For example, I always use DDIM/46 steps/CFG 5/1280x1280px mask
(sometimes 21 steps for speed inpaint), because this is just my preferred way + I'm on RTX 4090, so such big mask takes 12~s to generate with CNet on the fastest sampler. These are not the absolute recommended settings, just preference.
Here's one for you: can you tell which one of the below images was made with Inpainting model?
Prompt: frog sitting on a rock
And here's a more extreme example: massive rock in the grass
If you can tell which is which, now you probably immediately could tell why at this point.
Judging by those images, you can also probably tell, that all of the above makes completely no sense, because it doesn't seem to be true at all. So, now, as an exercise, take any image, get both Inpainting
and regular
versions of the same model, try adding something completely new to the image. Well, if you paid attention...
This is more of a disclaimer - I'm not telling you, that according to the above observations, you should only use Inpainting version models when adding/removing objects. You can just experiment on a regular model with different token weights in your prompt and denoise strength and you will be just fine, except you will have a bit less control over the masked contents. Keep in mind, that regular models tend to fill your masked area with less accuracy, meaning that the blend will be less coherent than what you would get with Inpainting version models. The Inpainting version helps with a more coherent fill, which should be accurate in majority of the cases as long as very high denoise is being used and you are experienced enough to tell how Stable Diffusion will interpret your prompt with provided settings.
There is also an alternative approach to this problem, but it's a bit different.
You probably noticed, that ControlNet also has an Inpainting model. What's up with that?
Important: do not confuse this model as a must use with Inpainting version of the checkpoint!!! They should not be used together, otherwise you might get totally unexpected results - the only reasonable combination is when Outpainting (both can work, but that depends how complex the base image is. You might get a visible edge on regular checkpoints)
Fortunately, ControlNet comes to the rescue with it's fine-tuning model for regular, non-inpainting versions, but, unfortunately
, it requires way more tweaks.
Here's a "plot" of ControlNet's Mode of a non-inpainting version model, where the Masked Content
is marked as Original
and Latent Noise
.
The image had enough context to represent the background, I deliberately chose just a part of the image, not entire tree to show how Stable Diffusion handles the content blend. A more appropriate example would indeed be masking an entire tree, but with this it's easier to show the difference. For a proper example, scroll up to the beginning of this section.
This obviously is not a final step, you would reiterate on this image with Tile Resample to get a more coherent blend, but that's not what I'm trying to explain here.
this will obviously, and again, depend on all of your settings and image complexity and I will keep repeating this a lot.
I noticed, that usually, the ControlNet's inpaint
will retain the shape of whatever you got under the masked area without ControlNet, assuming the same seed was used.
Not an accurate description, but the difference between the ControlNet's Inpaint
and Inpainting version of the models
is, that Inpainting version somewhat resembles the original more than what you would get with CNet's Inpaint, which gives a more coherent non-CNet-non-inpainting version model result.
To make it a bit more accurate, let's say you generate two images with high denoise (at least 0.86):
Assuming [i]
is just some Base Image. The following will be different for different kinds of inpainted images and prompts! Keep that in mind.
- on non-inpainting version model, you get
[x]
result- CNet's Inpaint on non-inpainting version can be similar to
[x]
, but[i]
may get lost
- CNet's Inpaint on non-inpainting version can be similar to
- on Inpainting version model, you get
[y]
result that will be similar to[i]
- CNet's Inpaint with Inpainting version model can be similar to
[i]
, but[y]
may get lost
- CNet's Inpaint with Inpainting version model can be similar to
I will try the same test, but on a smaller scale to see if there even is any difference.
Masking a small area, prompting for crab in the sand
, again the same test on non-inpainting version model
with/without CNet's Inpainting, then with Inpainting version model
and with/without CNet's Inpainting.
Again, Inpainting version model with CNet's Inpaint kind of works, the similarities are just a bit different than in the above explanation:
second
image has a clearly distinguishable retouched areathird
image is now more coherent,second
got lost, retouched area is less visiblefourth
resemblesfirst
, retouched areas is least visible, bit fixablefifth
had its composition changed,first
andfourth
got lost
This is why, in AI world, everything will be different. It can be similar, but never as you would expect it to be. The results will almost never meet your expectations, because everything is just so unpredictable. This is also why you should never expect things to go smooth when you are following tutorials! Even I still get many bad results after spending lots of time on something, as someone who spent nearly 10 months in Stable Diffusion.
All extensions are available inside of AUTOMATIC1111 and I only use a handful of them. In fact, you only need two, nothing fancy is required.
The main extension is Canvas Zoom
, as this allows panning/zooming/changing the brush with shortcuts.
Another, most important extension is obviously ControlNet, but I only use just one model (95%), others are completely unnecessary in this workflow, so I only go with Tile Resample
The only thing you need to know about this extension, is this:
This extension is a little buggy and annoying, it can get some time to get used to it
To make it as simple as possible, you can think of Tile Resample as: take my image and generate something new, making it as close as possible to my input. This is one of the big secrets of super-detailing. In combination with Detail Tweaker LoRA
and with correct settings with the source image, you basically resample the image in a way, that you introduce way higher level of detail than with any other tool.
In this process, you will always want your ControlNet to be empty to automatically pull the input image.
You also will work with at least 0.85 denoise with this model, as this will not change the image at all (depending on the weight).
Since ControlNet does not seem to automatically download this model, below are the links for both config and the model
- model: https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11f1e_sd15_tile.pth
- config: https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11f1e_sd15_tile.yaml
Model goes into: your_sd_directory/models/ControlNet
Config goes into: your_sd_directory/extensions/sd-webui-controlnet/models
I think config might already exist, but just in case I'm leaving the link.
Please take a second and look at the ControlNet options, as they are important. They mean exactly what it says in the form, so you can already deduce how they will affect your results.
The only options you don't have to care about in this model, are Starting/Ending Control Step
and Down Sampling Rate
. I never use them anyway. Down Sampling
is not even documented, but from my understanding when generating Down Sampling
1
and 8
is, that with value of 8
images have their darker colors generated with more strength. It's not a massive difference, but to some degree it exposes the Black/Shadow
more, so I'm blindly using this value, don't follow me on this.
Below will be a small comparison of the ControlNet modes to the Base image
My prompt is more important
- shifts attention more towards your prompt, but still trying keep whatever was under the mask (or in the input in general) - almost never used by meControlNet is more important
- makes sure the overall composition will be as close as possible to the input. Warning: this might not work well with realistic models, images may appear "cooked", or over-sharpened, which will ruin the while effect. This is kind of similar to what wouldBalanced
mode look like on low weight - rarely used by meBalanced
- a mix of the above two. Useful most of the time - I used this like 95% of the time
Open the image in new tab for bigger resolution.
Disclaimer: the information below has to be taken with a grain of salt, because it highly depends on the complexity of your image, all internal settings, upscaler model and upscaling settings. While I still recommend a maximum grid of 4x4 big tiles, this can still work on smaller tile size!!!
Also a side note: in my workflow, I'm not using this feature to "introduce better upscaling". I'm intentionally using this before the actual work to check what artifacts I get in form of new objects or shapes. Since I do landscapes 90% of the time, this is not a problem at all. When you upscale characters like this, you will get artifacts in form new people on people (when you upscale big images with small tiles), but that's not my problem :)
I didn't really plan on covering this, because it's not really needed in my workflow, but it's good to know this extension. Sometimes, I use it with even higher denoise to eventually get something new I can inpaint later. I never upscale the final result, because it introduces new artifacts back in form of this jumbled blended line mess and noise, which defeats the entire purpose of this process. This applies to 99% of the AI-generated images on the Internet. It just looks bad.
This extension is actually a trap, because you can not upscale infinitely. Why?
While it is partially true, that you can upscale images infinitely with this extension, that comes with a big cost. Now, you might think, what cost, if there is no VRAM limitation? Actually, there is. You see, there are tutorials, that explain how to upscale images to 8192x8192px. They seem to ignore the most important thing here: the higher you go, the less context Stable Diffusion has to work with.
Take a look how big the default tile size (512px) is if you were to upscale a 3000~px image twice (the white box):
I think you already have a pretty good idea why it's going to be bad - that is, if you were paying attention on Whole Picture vs Only Masked section. Stable Diffusion simply has not enough context to work with to properly upscale this image.
And again: why?
Check this out first. I have an example image of 3172x3172 pixels
, scaling twice
on default settings (512 tile size).
The upscaled part is 1/169th of the entire area!!! Stable Diffusion will try to resample this image, but outside its boundaries there is way much more information, that you are simply risking each tile being unrelated in its context.
Not in its entirety of course, even on high denoise. I mean, mostly Stable Diffusion can interpret something on the image as a different shape based on some part that unfortunately got in the way. An example would be a blended background having some sort of a city very far away and only small part of it is visible. If could be changed into a mountain and on the next tile it could be interpreted properly as a city. This is obviously an extreme example, mostly you will get "something from nothing", meaning you can randomly get a bird from a certain shape in the water.
Now, you could just say: yeah, simply lower the denoise. Yes, but that's not the point. This is almost as same as just upscaling with Extras or Tile Diffusion, unless you fiddle with your settings, tile size, etc. You won't really introduce any detail at all - assuming you just upscale to have a big image and move on. Yes, there is add_detail
lora, but it its effect really is greater on higher denoise with Tile Resample
. In general, upscaled images on low denoise will just become blurry. Then again, you might say: well, just use Tile Diffusion, what are you doing?. Its results are not that much more impressive either, and I'm saying it as someone, who's allergic to unnatural shapes on AI-generated images and obviously visible blurry noise in raw scale.
Enough sidetracking.
Now some ControlNet Output:
Canva size: 4096x2048
Image size: 1024x512
Scale factor: 4
Upscaling iteration 1 with scale factor 4
Tile size: 512x512
Tiles amount: 32
Grid: 4x8
The key part of the information here is the Grid
. This is important, because you actually have to pay attention to what your tile size
is in relation to the calculated target size.
So, really, the highest you can go is the maximum image resolution your GPU can handle x 4
(preferably 3). For example, if your GPU can do 1024x1024px
, the maximum image resolution you can upscale to is 4096x4096px
, because you may get
This obviously depends on many, many different things:
- the complexity of your input image
ControlNet mode
/ weight- CFG
- denoise strength
- internal settings of your sampler used (*)
- prompt - it's okay to keep the prompt if the image has very little subjects, which are present in almost all calculated tiles, like Sky + Ocean, but it depends on the input resolution. If both subjects fit into the entire tile, e.g. doing a
4:1 aspect ratio
, it's completely fine if your prompt is not empty and you have something like `cloudy sky, dusk, ocean. With this, the chances of getting something weird is smaller, because there is enough information for Stable Diffusion to properly resample the tile.
(*) - yes this matters. Notice, that DDIM has a noise multiplier as well as Ancestral samplers, Karras samplers can be tweaked, UniPC can also be tweaked and there are sigma tweaks. Normally you wouldn't touch this, but if you like experimenting, it's completely fine. I sometimes play with DDIM noise multiplier
and overall img2img noise multiplier, but only to do some comparisons, not to "improve" the actual workflow.
Comparison of 512px tile size vs 1024px on 4:1 - https://imgsli.com/MTkwMTk3
Notice how unlikely it was on this ratio to get something off or visible seams (you can still see them). This was also mostly thanks for the smaller scale. If this was an actual 8096px
upscale, I would start getting visible seams and discolored tiles.
Here's another comparison, where Stable Diffusion clearly has not enough information in some tiles, there simply is just too little context and in some unfortunate tiles it will create visible seams or create some weird shapes https://imgsli.com/MTkwMzYy
I pointed out the most obvious errors you can immediately spot. Now, the most interesting part is the bottom-right.
the left part is just a 512x512 piece of the image upscaled with 1536x1536 tile size
When Stable Diffusion upscaled that part with 512px tile size settings, the water from the original image was just so blurry, that it interpreted it as... looks like melting penguins to me. This bottom corner is special, because this is just the bottom half of the tile being processed, which is the leftover from padding settings - yes, padding is processed later, as padding remains unprocessed until the entire image is done. Padding is helpful for context and in combination with mask blur, it reduces the seams (they still appear, but less, which tells you that you are on the right track with tweaking).
Having the second full image comparison in mind, you might argue here, that seams fix
would... fix this issue, right? No :) This is a separate process, which resamples the tile overlap, which does not guarantee a seam fix at all.
I will keep repeating the same thing over and over, there are no perfect settings, you will experiment by tweaking and fine-tuning everything until you get what you want. Stable Diffusion is NOT about do this
and immediately get that
. If you get baited by tutorials saying: this is finally the one and only, ULTIMATE solution to [x] - it is not. Trust me.
For me, settings that kind of did work on big images, is just by cranking the tile size to at least 1/3 of the size on certain axis, as I already explained somewhere above and depending on the background. I almost never go above 2048 pixels anyway. I also noticed, that for smaller tile size, and if you really have to use lower size, use 768 pixels with maximum padding (128px). This will sample 512x512 tiles, having 128 pixels of context around the tile. The same applies for 768px tiles, you switch to 1024px tile size
with maximum padding, but the mask blur might require some fine-tuning, from 16px to 32px. Recommended settings in Ultimate SD Upscale extension's repo don't help much, you need to experiment yourself, as every image will require different settings.
The below information is subjective!
This does not really belong in the Prerequisities section, but there's just one small piece of information you need to know: there is no best sampler, it's all subjective and requires TONS of plotting on your own prompts and settings. Every sampler will give you the same level of detail (subjectively) when fine-tuned. For example, I always (and only) use DDIM, because it's the fastest sampler across the board both on f16 and f32 above 75 tokens.
That's all that matters to me, I don't really care about the convergence, but on a weaker machine, this is definitely something to look into, like lowering to 15 steps, since DDIM, for example, converges around 8 or 10 steps. Above this value, there is just a minor detail improvement and composition change, so if you don't care about the speed that much, a higher amount of steps won't hurt you if you really need to fine-tune even more.
I use DDIM/CFG:5/Steps:46 (or 21)
, this seems to be a sweet spot for inpainting (for me at least). I'm always using 46 steps with special option: With img2img, do exactly the amount of steps the slider specifies
. I never go below 0.75 denoise anyway, so with default settings and my old Steps count (60) SD was always doing around 45-60 steps (I change denoise sometimes). Since the change above 45 steps is very minimal (even below this value!), it's just better to use lower steps and I just want to use a constant amount. Doesn't matter that much anyway, but when generating bigger images + ControlNet (sometimes with two units), time matters, since ControlNet adds to the generation time - in addition to the image size, this can add up quickly. For example, my masked areas are usually around 1.5-2 megapixels, which is in most cases it upscales the masked area 2-4 times, depending on the masked subject. I change aspect ratio a lot, so it's hard to tell exactly, but I won't do this in the workflow documentation.
Denoise is a multiplier of your steps. The option mentioned above should be used if you work on lower denoise, because there is no need to do e.g. 60 steps on 0.1 denoise.
If you are using for example 100 steps on 0.1 denoise, by default, SD will only perform 10 steps. Some rounding is applied, so it might do -1, so that's why I always add +1 on a DDIM sampler.
But you know what? You can forget about the above information anyway, you will be perfectly fine by using something like this:
- DDIM / Steps: 8 / CFG: 5
- DPM++ 2M SDE Karras / Steps: 6 / CFG: 4 (this is not a multi-step sampler, so it won't be slow)
This is just one of the examples, I use different settings, because it's fast on my machine anyway. Use whatever is your preference.
There is some more in-depth info if you want to just check how other samplers work, but I don't rely on this information.
The speed limitations also do not apply to the PLMS and UniPC samplers, but I don't really use them, because I'd need to change the settings to match the quality. Switching to PLMS would require changing my usual CFG and amount of steps, which I haven't really tinkered with in comparison to those samplers. All I know is, that I won't be using other samplers, because on RTX 4090 they are ~40% slower above 75 tokens and if I ever have to use a longer prompt for some reason (mostly embeddings), there will be no sense for me to change the sampler, because the speed matters in such workflow and I will just be wasting my time on no quality upgrade (I'm on SDP+upcast so I'm always going full speed no matter what - I don't need the upcast, I was testing stuff in the past so I just left it there as there is completely no difference in speed/quality). I don't even care other samplers create a different composition if the level of detail is the same when fine-tuned ¯\_(ツ)_/¯
For example, check how the fastest samplers react to specific CFG scale/steps, which clearly will tell you why plotting just samplers for quality comparison on the same settings makes no sense and should be never used to judge the sampler quality:
There is a very interesting observation: since those samplers are the fastest on long prompts and UniPC works with a very small amount of steps, you will notice it can "kind of" give you a kind of a concept art-like style, but it's very hard to tweak, because you need a low CFG, which basically means SD will do whatever it wants.
U used the following settings for UniPC (ignore the order
, it's on default 3
) and I think it was 3 steps and 1.5 CFG.
The quality obviously is super low, but it's a very efficient idea printer.
Anyway...
Benchmark speed proof with cross attention optimization method
plot of all samplers: https://gist.github.com/DarkStoorM/3c10c0c027ef1c90c230ffa0cc39213c
Since most people use lots of super long positive and negative tokens, which exceed the first chunk of 75 tokens (they do absolutely nothing in the negatives anyway on certain samplers, like DDIM, they just get ignored), pay attention to float 32 OFF above 75 tokens
section. This obviously depends on the machine, so don't take this as a must-use settings.
Although, speaking of other samplers, there is a very important part you need to see for yourself (open in the new tab):
Samplers marked with M
are multi-step samplers. Don't use them, they give you no benefit, no quality improvements. You will only be wasting your time, because they are twice as slow as regular samplers.
I grouped all samplers according to the composition they create. DPM Fast
and DPM adaptive
are the only unique samplers, that create something that is somewhat in-between or like a cross-composition result. Don't quite me on the last sentence, though. This is just a random observation, all samplers can create almost equal composition when fine-tuned.
Everything, that you read here (or skipped) was based on my months of experience with Stable Diffusion and some "for fun" testing. By the way, if you ever have any question about how many steps to use, how big the CFG scale should be or which sampler to use with certain settings, there is a very simple answer for that: X/Y/Z
plot.
Let's say you want to find a sweet spot for a couple samplers: DPM adaptive
, DDIM
, DPM++ 2M Karras
.
Make sure to tick Include Sub Images
if you want to just grab the single image you like from the plot. They will appear in the output gallery.
Refer to X/Y/Z Plot Syntax for more info.
This is all just subjective information, because it heavily depends on your prompt, LoRAs and Embeddings used, so first you would finish your prompt, generate an image and if it looks OK, then you do the plot, because changing just one token can result in a different image.
Here's a full X/Y/Z Plot
(you don't need to use all types, even just one is fine if you just want to plot CFG for example). With this, and if you just do txt2img
when you fish for a base image, you can pinpoint the image you like just after looking at the smaller scale, searching for interesting composition changes or just some details. Sometimes, the image won't change much, so you might need to adjust your prompt. In the "fishing" phase you don't care about the quality of the image, but what you got in the batch generation and then go from there.
If you skipped all of the above, I won't be explaining what all the selected options do and why at all - but maybe a little (when I get to the workflow itself). The purpose of this workflow is to introduce you to the Super Detailing
and Subject-Oriented Inpainting
on high resolution with most of the presented information in this document, although, I'm not using all of the inpainting features in one artwork.
By high resolution, I don't mean like actual 8k, because it doesn't make sense and can be extremely time consuming and just too annoying, but it's definitely applicable.
This does not explain how to create any artwork of any theme - you are on your own here
I can do way better images in MidJourney! - well, you do you, I'm very harsh when it comes to the quality of AI-generated images.
Speaking of other tools, why am I not using InvokeAI, ComfyUI or SD.Next (Vlad's fork)? Well, it's pretty simple: preference and muscle memory. I already got used to A1111 + Photoshop, so I don't need other tools, that let me do basically the same thing, so it's just not worth moving to something different. I have a very specific workflow and style, so switching to something else would just be a waste of time. I'm not even mentioning MidJourney, because it's completely useless in my case.
Sidetracking a bit here, so: the entire workflow is way too complicated for a regular AI-tools user, because it's a very long and painful process of regenerating basically the entire image from scratch not once, but twice along with applying many corrections.
So, how does my workflow actually look like? At this point there's already like 10000 words of me rambling with basically no concrete information, so let's get on with it.
There is actually a minor split in the workflow, which only depends how I start working:
If this doesn't make any sense to you, here's a more thorough explanation:
Method | Description |
---|---|
IDEA |
This is the starting point, which is not about mindless token spamming in the txt2img hoping to get something (there's a separate, lazy phase for that). This is a really nice phase to force some creativity rather than brute-forcing composition. Sure, Latent Couple could work, but still, you won't get exactly what you want at the right place. I'm trying to make a very rough drawing of my idea in Photoshop on 960x576px canvas (size has to be divisible by 64 for compatibility purposes). This is important, because when re-generating it on 2x size (1920x1152px ), there is a big chance of altering the image in a way, that I get something new, but the composition remains unchanged. |
TXT2IMG (if lazy) |
This phase is even more annoying than drawing, because with this you won't able to generate what you want, so this is only useful as an idea/base printer or for generating images for inpainting practice with no specific composition in mind, so I'm almost never using this when actually working on something new. |
IDEA2IMG |
This is my main starting phase, where I draw things as I imagined them, at the right spot, with the right size, everything according to the idea. I have full control over everything, so this is the most ideal solution to consistent composition. I like this a lot, because SD won't really interfere with the overall idea, just fine-tune it. A simple example: I'm drawing a grass field extending into the distance with randomly colored shapes. At this point I expect something new to be inserted there and it is completely fine. Drawing "noise" helps with getting new objects, where solid colors will most likely ignore the area - assuming Tile Resample is enabled on Weight 1 (or <= 0.5 if the drawing is too simple ). |
IMG2IMG post-drawing |
When the drawing is done, I move it over to my SD and re-generate the image on 0.9 - 1 Denoise with ControlNet + Tile Resample enabled on Balanced mode . Balanced mode is important, because both prompt importance and CNet importance are too strong to use with simple drawings - assuming there somewhat is a level of detail present - (exception: no details on the drawing = don't use Balanced), but it depends on the complexity (CNet importance never works on simple drawings unless you change the weight to <0.5 ). I do almost everything on Balanced mode, then on prompt importance if it doesn't go well. The image is re-generated on resize by 2 option, and img2img upscaler can be used to force a slightly different generation at this early phase, because the entire image is being rebuilt (only on prompt importance with Weight 1 ). I still experiment with the initial generation, because it will behave differently on every new image, so there are no perfect settings. |
USE NEW IMG AS REFERENCE |
When the "1080p" image is ready, I look how the overall result looks like. I check what new objects have been added to the scene and in which place, then if I like something that appeared on the image, I go back to my drawing and adjust it by redrawing stuff according to the generated image and correcting the lighting, sometimes the colors too. I repeat this process a couple more times, occasionally increasing the drawing accuracy for better results. |
INPAINTING (first pass) |
The actual work time. Starting from Background to Foreground , I mask bigger areas describing one subject, e.g. the sky, mountains, forest, other big objects. With this, all subjects are resampled one by one with the correct scale and with different settings (at most 3-4 upscale). If you remember from the document, SD can fully focus on one subject, so that's why inpainting will be more accurate with the details. In combination with Detail Tweaker LoRA, this is a game changer. |
QUALITY INSPECTION + correction |
This is a mixed phase with INPAINTING , where I look at the overall quality of the image after the first pass, adjust the highlights here and there or fix the image by repainting, play with the colors and lights if there are any actual sources, like campfires, torches, which need to illuminate the surroundings. When the first pass with quality check is done, I throw Photoshop's Camera Raw Filter in and play with the settings just to see if the initial correction is actually needed. I sometimes just apply a random filter if it fits. |
INPAINTING (second pass) |
This is the more complex part, because big subjects have to be split into smaller areas if possible, then resampled. Already detailed images will have an even higher level of detail, which is the actual black magic of this workflow. In this phase, I sketch tiny details on the subjects or do some fine-tuning in order to get a bit better generation. For example, there is a tall cliff with poor details, I will just draw some random cracks with highlights/shadows or even random lines and SD will be a bit more accurate. This manual guidance is the key to the higher level of detail. |
QUALITY INSPECTION |
The second pass of quality inspection takes a bit more time, because now I have to focus on smaller areas and look for more unnatural shapes I forgot to inpaint or the ones, that just got skipped for some reason, so I could re-inpaint them later. Think of this as Quality Assurance. Double-checking the smaller areas, looking at the light sources again. If lighting or shadows were not corrected in the first Quality Inspection pass, they might become unfixable and it may be required to loop back to the first Inpainting pass if a bigger area needs to be reconstructed. |
RE-INPAINT if needed |
In this very late stage, I look for some minor improvements that can be made, like changing/removing some objects completely if I don't like the composition or just test stuff with further resampling. |
MANUAL COLORWORK |
Final phase of color-tuning on the scene, mostly playing with color overlays on light sources and surroundings, shifting tones |
CAMERA RAW FILTER |
The final step is always the good ol' Camera Raw Filter in Photoshop. A couple weeks ago (somewhere around may 2023) Adobe updated the filter with built-in presets, which work really well on photos, but some of them are really good for Artworks, like Cross Process , Turquoise & Red and Red Lift Matte . Their default settings are nice, but I always play with the settings afterwards anyway to boost the colors. |
And, that is it. This is the entire workflow, which for regular AI users is just way too overcomplicated, but hey, that's just my preferred way of working. I want to squeeze as much detail as possible
Have you noticed something in this workflow? Is something missing? Yes, it's a step 99.99% of people do with their images. It's the final upscale.
I never upscale my final image, because it just introduces the random noise back and distorts the edges. It's only good if you intend to print your stuff or if you don't really care about the actual quality. Other than that, it's just making unnecessarily huge image (in Megabytes, not size), which most services either don't accept or have to compress down. Don't get me wrong, they can look good, but there's no point in doing that if they are going to be viewed in the same size before the upscaling or lower. Again, that's only an unnecessary noise, which in effect is almost as same as over-sharpening in photoshop, which defeats the whole purpose of this process (examples in the Conclusion section).
Below are three versions of my image from an older workflow (I like that image, just wanted to share it, don't judge me):
- Final image, no upscale
- Regular upscale through Extras with 4x-UltraSharp (imgur didn't like the 10000~px image, so it got resized to 5000px. Sad)
- Ultimate SD Upscale (4x-UltraSharp)
The incorrect aspect ratio on all of them if the effect of
manual resolution change -> upscale
, that is to be expected when having a fixed, custom aspect ratio, like 1080p or higher, which is not selectable in the webui (divisible by 64) or just upscaling through extensions, which can slightly change the resolution.
I deliberately chose 4x-UltraSharp here, because everyone uses it (we all know its effects, but it's just to show how it looks like in raw scale). Again, don't get me wrong, it can look good, but upscaled + previewed in lower scale is no different, there's like 0.05% change, so a minor difference in sharpness, which I can also just add through Photoshop - ignoring the Imgur's compression artifacts from the above examples. I see absolutely no point in upscaling the image.
So, that's was it for the workflow. Now I will document the entire process with a little explanation.
I will be posting direct comparisons from Imgsli instead of images side-to-side, so you can have a better preview of the before and after images.
If you read the first half of the document, you know how Latent Couple
masking looks like:
This is a photoshop mask (rough approx.) just to show how I mask objects during inpainting
This is pretty similar, except you just have full control over the entire scene. This workflow shows how I start inpainting my images. All of this is just about masking specific areas, which only include certain subject, like on the image above: brick wall
, cactus
, torch
, crypt, ruins
, however you prompt them. Depending how many big areas are there, I sometimes do all subjects in one pass.
The process goes like this: mask, generate, repeat, correct in Photoshop if needed (or any other software of your choice). All it does is upscaling the masked area a bunch of times (preferably max. 4) and in combination with CNet's Tile Resample and high Denoise, it will generate an actual highly-detailed image.
I really recommend using at least 0.9 denoise with Tile Resample, but I don't know if it works the same with other models. You will change this value sometimes, but only within the range of
0.7 - 1
. There are some situations, where you will have to lower the denoise, like when SD takes too much color value from the context. You might have to play with denoise, but I would prefer forcing the color through prompt.
This is a territory, where lots of experience is needed (speaking of dynamic masking). Half of the success is visually approximating how big the masked area is to change the target size accordingly.
You don't have to be super accurate on this. let's say you are inpainting a 1024x1024
image and your masked area is around 256x256
, which visually is 1/16 of the image (1/4 of the row/column). You immediately know you can just upscale it to 1024x1024
. As I already said, it's totally fine if you accidentally upscale 5 times, but it can be bad in some cases - blurring your result images even more.
The worst effect of this is when inpainting very small objects. If you are masking for example a 128x128 image of trees in the background and upscaling it to 1024x1024 (10x), the image will be even worse - super blurry.
Detailing example:
So, why does this work if upscaling images introduces artifacts all the time anyway?
Why can't I just upscale the image and downscale if the process is exactly the same?
I think you already know why. You upscale every subject differently. There it is. This is the whole secret. Every subject/area needs to be upscaled with different scale value or different settings. Sure, you will introduce artifacts with this, but when they are downscaled back, the artifacts are almost gone and the level of detail remains almost the same. You just can't achieve this through upscaling by extensions, but it is important to experiment with the amount of scaling on every subject. If you scale it too much, like 128x128 -> 1024x1024
, the generated detailed image will be downscaled so much, that all of the information will just get lost and the result will be super blurry and you would have to apply manual corrections in the image editing software.
Below is a quick example (I use TAESD
live preview method and the big preview will be inaccurate, but that's because I don't want to use FULL
to not block the generation, I don't need full quality previews)
TAESD
is a micro-optimization if you have frequent ui updates on live previews.TAESD
is a low-accuracy preview, but it's fast and SD doesn't have to waste one step on rendering the image, making itself a bit faster.You can change this in
Settings > Live Preview
This was a masked part from the above 1024x1024
image. There is only one small problem here. SD has a context of the background and since this is an irregular mask shape, you can probably deduce why the tree turned green. This is one of those mentioned situations where you have to lower the denoise even with Tile Resample, which allows using Denoise of 1 to retain the shape, but I can't tell the exact value, because this depends on the image. You could also force-prompt for (yellow tree:1.5)
to switch the attention, which will actually make it brown.
This is a very short TL;DR of what you can do. There are way more steps in my actual workflow, because I use Photoshop to move images back and forth. I don't use Photopea extension, it's way too buggy and broken on new Gradio versions. Not gonna bother with breaking extensions on each update + Photoshop's Raw Filter is a super powerful post-processing tool anyway.
I rarely use txt2img to get a base, because I mainly work on landscapes and I prefer creating something, that actually is super close to what I imagine, not something lazy from regular prompts. The only exception when I skip the drawing part is if I get a scene from Unity (I got a bunch of Nature assets for scenes), when I'm actually feeling lazy or I just want to practice inpainting.
I initially wanted this to be an interactive guide, but with things like Stable Diffusion it won't make sense, because most of the times your results will be different than mine and it will just be frustrating, so I will just go over my entire workflow, to show you how to mask properly, how to do dynamic contexts and adjust the target image size.
I usually start with half the size of 1080p, 960x576px
, except you can't actually generate 540px tall image, because it's later rounded up when upscaling for some reason, so I just go with this size and later crop it from 1152px to 1080px.
I will go go with the following composition: I want there to be some sort of a watchtower "inside" of a mountain with a tall bridge (probably not going to look like a bridge at all), maybe a river in the middle and some land. Not sure about the mood yet, but I will go with my regular theme: dusky sky, which I rarely get right and most likely won't look like that. Maybe SD will be able to handle it. The rest I can probably improvise after seeing the first results.
If you really want to follow along, just to see how it feels like, I assume you already have Photoshop. Otherwise just read through, maybe you can pick up some new stuff, but I would recommend against it as I tend to be very chaotic with my work.
NOTE: the quality of the resulting image will be lower than expected, because it takes time and I will rush this just to show how my workflow looks like. I might skip some phases.
Alright, so I will go with the following, very basic drawing, that everyone can make in 2 minutes (15 for me, because I can't draw).
And of course the shadow is wrong, sheesh. Anyway...
First thing I always do with my drawing is just copying it over to SD and regenerating it with different settings, seeing how it looks, if it gets anything new. I got my base composition, which I want to see on the image at this point I don't care if there will be some new stuff, like mountains or trees (the drawing kind of gives it away, I assume SD will just shove some mountains in).
If you remember from my Positive Prompting section, I use three separate sections, which you can see if you break it down visually:
<enhancers>, <subject>, <loras>
. The Negative Prompt will always be the same, I don't see any point in changing thisAlso, remember, I change embeddings/loras names
// positive
zrpgstyle, detailed background, hi res, absurd res, ambient occlusion, (stylized:1.1), masterpiece, best quality,
perspective scenery of an epic landscape in the dusk, dusky sky, big flowing river with rocks, foamy waters, two grass lands, big tall watchtower with bridge in the water, dusky sunlight
,<lora:AddDetail:1>
// negative
monochrome, simple background, NG_DeepNegative
I really tend to think all those enhancers and """prompt engineering""" techniques are completely useless when inpainting and it only comes down to how accurate your source image is. Well, my input image is bad, so I don't expect much. I will leave those tokens until I find more time to experiment with this, but I'm 90% sure they are not needed.
Now, here's the thing. I won't blindly just throw it into img2img
with max denoise and see what happens. No. The point is to create something from your own drawing, which should be at least 90% of the composition. You will change this while you work anyway, but it's way more satisfying if you actually get what you want.
Since I switched to the new PC, I don't have to use Ultimate SD Upscale
anymore, and just generate straight up twice the initial size (1920x1152). I used to upscale through the extension to get something new, but I noticed, I get way better results on regular reiteration. At this point, sure, I'm using 1 Denoise
, but there's the twist: with CNet Tile Resample you can safely regenerate your image without losing your composition. It's a really, really powerful model.
So, I do the initial generation on many more steps than needed, 46, but I'm not looking for quality image, just a higher amount of inserted objects, but without changing the composition.
I'm going with the following settings:
Steps: 46, Sampler: DDIM, CFG scale: 5, Seed: 750522234, Size: 1920x1152, Model: aZovya, Denoising strength: 1, ENSD: 31337
46 steps, because DDIM sampler does some rounding and on 45 it will do 44, so I have to force 46 to get 45. Major image changes can sometimes occur every 5 steps, so it's just my preferred way 🤔
You can as well just do 20 on
DPM++ 2M SDE Karras
if you have a long negative prompt, which will yield a bit different result, but it won't be better. They will be similar most of the times, but that heavily depends on the model and how long is your negative prompt
Since I always have ControlNet on, for this initial image I had to do 0.5 weight on balanced mode
I always use random seed, copying it here from the output just in case.
Now, you might think this is completely fine, looks nice, the artwork is done. Nope ❌ This is like 5% done.
After seeing this image, I can adjust my base drawing, because I see some nice improvements, so this will now serve as a reference image I can look at while applying corrections to the base. I like how I got lucky with the sky and my bad shadow got turned into a forest, but I don't like the colors on the tower. Back to Photoshop.
While looking at the new reference, I adjusted a couple things along with the sunlight and applied a Raw Filter from Photoshop - Red Lift Matte
from the new presets.
And this is the result on the same settings. I drew a small ramp, but I had some problems with generating it with the lowered CNet weight (I had to stay on the same settings), so I will just inpaint it over the image later. No, prompting for stairs does not work, it makes stairs wherever it wants and I'm not going to waste my time on Latent Couple
just for that, which will also not work with Tile Resample in this case anyway. Lost my rock, but I kinda like more what SD gave me - I might change it later though. I also don't think I like the top layer of the bridge I drew earlier, so I don't really care what SD makes there - I will be happy even if it turns out completely flat.
Just for the sake of this document, I will switch to some random realistic model while working on the image: Photon, because everyone uses realistic models ¯\_(ツ)_/¯
Also notice I won't be using an Inpainting version of a checkpoint (this model doesn't have it anyway), because despite having to "create" objects, I work on sketched areas, so I can just use a regular checkpoint. This is just easier and faster.
Changing the model won't affect my style in this case that much due to the specific level of detail present on the image, and with ControlNet in addition, I don't have to prompt for stylized
anymore. CNet will somewhat help me retain the current style, so I will now safely go with a very simple prompting. The model creator says to not use negative embeddings, but I don't see any bigger changes with my usual negatives:
// \/ SPACE HERE!!!
photo of
my_subject_here
,<lora:AddDetail:1>
The prompting will be super simple, so I will just type whatever I'm inpainting in there. No enhancers or any other crap.
As for the settings, use whatever you want. When inpainting, I'm using 21 Steps / DDIM / 5 CFG, 1 Denoise, CNet weight: 1
with Color Correction
on and ScreenBooster V2
upscaler (which I will be changing later). The CNet weight and mode will constantly change along with the Denoise. I could lower the steps even further to just rush it a bit more, but I kinda want to keep the slightly better quality (21 steps because of the rounding on DDIM. Sometimes you end up with -1 steps and the composition can change every 5 steps).
This is the main part of the workflow, when I start doing the real thing.
I will start off by adding the missing parts: stairs and the rock (I have to see if the rock will look better than the thing SD created). In Photoshop, I just draw over the wall to make something that resembles a staircase or a ramp. I could also just pray SD will make something from Fill
or Latents
, but why waste time on something, that won't work 100%. It's easier to draw and inpaint on Original
. Also, because I'm resampling a rough drawing, I have to switch my Controlnet Mode
to My prompt is more important
, because on Balanced
it will pay more attention to both CNet and Prompt, so it just won't work unless your drawing is really good or you photobash something in.
Now here's a super important lesson: you don't need Only masked padding
from the settings, you can simply just place a tiny dot to extend the context of processed image without actually changing the mask's shape. You will be working on 4px mask blur anyway, so the dots will disappear and dynamically created padding will stay. This is way convenient than playing with the settings. You can literally just tell SD where to gather the information from.
Before I get to the masking, here's how I work with Photoshop: I copy over a part of the image, which also lets me select a custom context, but I won't necessarily use all of it, because it depends what I inpaint. Here, I might not need it, but just ahead of time I'm selecting a bigger area in case SD will need context. Now, how does it look like in Photoshop (look at the Outline)?
Note: the reason why I'm doing this is, because webui is super slow on working with full resolution image and when you work inside the Inpaint + Canvas Zoom, you lose the ability to dynamically calculate the target resolution and on top of that, every generated image will take too long to finally render. Even on a i9-13900+RTX 4090 it can take 2-3 seconds to render a 1080p+ image, I can't even image how slow this will be on older machines with even bigger images...
Anyway, back to the context:
Make sure you do not move your selection no matter what!. When you paste an image back to Photoshop, it will automatically be placed at the same spot - basically where your selection is and since the images are identical in size, it's even more efficient, because you just paste the image in, apply corrections if needed, then merge layers.
About the dynamic context:
Now pay attention to the corners. In this case, I needed a bit more context, because I didn't get exactly what I wanted. I placed tiny dots at the corners of the pasted image. The downside of this is, that my mask won't be fully upscaled to the target size, but guess what: I can now take advantage of automatic target size scaling. I don't have to worry about calculating the target size if I was using the entire image as input in the Canvas Zoom.
You might be asking: why are you not using Whole Picture
then if you are extending the mask to the max anyway?
That's because Whole Picture
is not downscaling the generated image back, but upscaling. This is useless, because I would have to manually scale the generated image to fit, which is completely stupid.
Rather than manually setting the target size, I can just use the slider for a concrete scale value.
You have to click on the input box (or outside) every time in order to update the size information below the slider when you paste a new image or update the scale manually by typing. It won't automatically update every time.
With this, it's easier to control the size in regards to your machine capabilities, because you can just see the calculated size and you will have easier time setting the specific scale value.
After pasting this back into photoshop, there's not much to do with this, because this is just the first inpaint and it will be resampled later, so all I have to do is just merge those layers and move on to the rock (select both layers and ctrl+e
to merge)
Now, the rock will be a different story, because I need more context since it's in the water. At this point I don't really care about changing the water, I just want to put a rock in there as I sketched in the beginning.
This will be my entire context of the image, but only the rock will be masked with some lines below it.
From now on I will be pasting here the entire context copied from Photoshop (the image inside the selection from above as an example)
So, the size of this particular area is 382x275px
, so I can safely resize it 4 times, which upscales the image to 1528x1100px
, which I'm fine with on 20 steps, because it takes 7 seconds with CNet enabled.
Prompting for massive submerged mossy rock under the flowing river, water reflection
I didn't like the results of the image, so I had to change the CNet weight to 1
and set Denoising to 0.75
(this mode requires lower denoise. Balanced doesn't care, you can go as high as 1)
Again, I don't care how this looks right now, because I will be regenerating this area. I only did this to change the composition and to show you an example how I solve this in my usual workflow.
Let's bring the entire image again for a moment. I want to you look at the image and imagine all areas that should be inpainted - how the mask shapes will look like initially. To make it easier, split the image visually into separate main subjects - sky, mountains, watchtower (is it called watchtower(?)), bridge, river, plains, etc.
The workflow here is doing the first pass on the bigger areas, as long as they fit into a single mask and can be upscaled at least twice - you might need to improvise if something is just too big and this will happen a lot.
In this example, if you would want to inpaint the sky, you can't just mask the entire 1920x460px
(top part). This is where thinking outside the box comes into play. You will sometimes have to split your masks either in two images and overlap + blend two images or mask to some visible edge, where it won't make a difference when joining two images, but it requires some careful selection.
Important thing now is, that you have to mask things in the correct order, because later on it will be annoying to fix the remaining areas. This mostly applies when inpainting something on clear backgrounds, which can lead into glowing or darker outline, so in this case you inpaint the background first, slightly overshooting over another subject so it can blend into the background. This might not work in the opposite way. So, what's the plan?
Of course, you could select the said area in its entirety, mask it and pray for a good result. While on a beefy machine this is still fine, you might get into some trouble if you are going for high resolution, because you will end up masking a massive area, where the upscale has no effect at all, but you might get lucky and get something usable, mostly with sky backgrounds. It might be difficult with things like Forest or Buildings.
While this size (the said 1920x460px
) looks huge, in reality, after scaling this 1.5 times to 2880x693
, you might be thinking it's massive, but this actually can be comparable in speed to generating around 1472x1472px
image (or whatever is closer, selectable in the webui). Both are still 2 Megapixels.
I don't know the exact math behind this, but to get a rough estimate for a square equivalent, it seems to be just:
This means, if your machine can somewhat handle 1400~px images, then you are good to go with further upscaling such images. You might be in a bad situation if you own GTX1070 or 1080, this can take up to 2 minutes with CNet enabled, depending on your settings.
Now, let's see how would that look like. Since we are dealing with Sky, there is a good chance it will just turn out good, because there are no complex patterns/shapes and I guess models were trained with high quality Sky pictures.
I masked this quickly, overshooting over the tower (deliberately) and mountains.
Now, the trick here is to overshoot with the mask in a way, that you mask the sky and a tiny bit of the tower edges, which will result in a finer blend if your settings are correct. SD might sometimes improvise by extending the tower, but that's what Tile Resample is for. SD should generate the edges properly. If you leave unmasked areas between the subjects, you might have visibly untouched spots, which will just look bad and might become unfixable on further passes (basically what you would get from a very bad photobash). There is a problem though - if the settings are incorrect the additional subject might end up getting either duplicated or enlarged.
Now the trick here is to prompt for [foreground] against [background]
, which might help a bit - something like that or just this:
tower behind a mountain sky dusk horizon
- literally. You might need to experiment with CNet mode and Denoise here, ranging from 0.75 to 1 now. The mountains will be different, but I don't care about that for now, I just need to test how sky looks like as this is the highest priority now.
It doesn't look that bad, but it might be better. I could upscale it three times, but sky doesn't need super detailing, so I personally would ignore this. I can't see anything wrong with this on raw scale anyway, except some visible noise in the area illuminated by the sun (the entire middle part). Once you see enough AI-generated images on the Internet, your eyes will be trained enough to immediately spot super bad quality and AI-specific artifacts in order to eliminate them.
The tower itself will be fixed later. Now let's get to the mountains.
Always think of this workflow as working from far background towards the nearest foreground
- trust me, it saves some unnecessary fixes.
There is a problem, though. It extended the tower to the right. At this point I'm not sure if that was a wide mask + CNet bug or the weight was too low:
It also got extended above the balcony, but this part is fine, because there is just sky in the background so it will be an easy resample when I get to inpainting the tower. The bottom has to be fixed, because it just looks like a part of is sticking out.
We need to fix that before proceeding. This should be an easy fix. What I normally do is I grab a Clone Stamp Tool in Photoshop, sample the background somewhere near from a matching spot (alt + left click
), then I brush it onto a new layer where the new tower edge was generated, erase the overlaid paint Sky from the tower and blend the edge on the sky, so it looks more natural.
The edit is visible, which you can clearly see from the darker mismatching blend, but it will disappear on later inpaints if I decide to fix the sky.
Actually, it's time to do the mountains, and since I'm on the right side already, I will fix it along with the sky. I will grab a bigger image to give SD more context about what's on the other side of the tower, so it can make a bit better sky. I will mask the entire right side along with the mountains, marking the dot-context on the other side of the tower (left). I'm just not sure, and since working on something in SD is just constant experimentation, I just have to see how things go. I will mask the foreground mountains on the right side (brown color), so it can blend them into the sky.
Prompting with: rocky mountains against dusky sky horizon
with 0.9 denoise
and 0.25 CNet weight
, upscaled (2.45x).
Tweaked upscale, because SD loves crashing on oddly sized images, so you might occasionally get this error, which tells absolutely nothing:
ValueError: height and width must be > 0
There's some remainder weirdness in the mask resizing, so just tweak the scale, it doesn't matter that much anyway.
Lowered the CNet weight, because it's usually better with things like Sky, Ocean, but that depends on the style.
This didn't turn out as well as I wanted, the sky doesn't seem to match, but judging by the left side of the entire image, it looks fine, so I'm keeping that. I also had to mask mountains, because it's gonna be easier to work with already fixed background.
also note, that I'm constantly merging new layers, I'm just not mentioning it, because it should be quite obvious to you. Sometimes I will also skip the image showing the mask shape, so you can visualize yourself how things should be masked when it's obvious.
Now it should be a good time to move to the mountains...
Now the image along with the mask. Here, I masked the mountains overshooting just a bit over the sky, extending the context to the left, so SD has more info about the colors. I will use the same prompt, but with 0.5 CNet weight.
Be extremely careful when doing a very wide image mask like this, SD has some trouble with weird aspect ratios, so it might might not put the generated image back into place correctly and just blur two images! It's better to stick to square ratios if possible. Sometimes a webui restart helps, I still haven't pinpointed the cause of this issue.
This is the result. I like how it put half of the mountains behind a slight fog. I can work with that.
Now time to test another wide shot, masking mountains only:
Here, I had to actually mask not only the mountains, but the entire area (except the water) and change the prompt to make cliffs instead, just for a test: rocky cliff lands against dusky sky horizon, tall rock formations
, with 0.25 CNet weight.
This is the result. Not quite what I was expecting, but I'm loving it so far. There is another issue, though.
I forgot I masked everything at the edge of the image. This is a mistake, which requires manual corrections later:
Not like this is a problem, though. It's a tiny fix with an eraser tool on minimum hardness. It doesn't have to be accurate, I can just resample it again later:
It's time to do the remaining part of the image. On the same prompt and settings, I will mask the entire mountain, along with the far background and left green side of the image, overshooting over the small, already generated portion to blend it.
Looks really nice for the first pass, but also looks bad if you look at the lower part of the generated image. The lines are unnaturally curvy/squiggly. To other people this already might look ok, but to me it's completely unacceptable for an artwork. I will attempt to fix that later.
The rest of the workflow does not really change, it's just masking big areas, experimenting with the settings, as there are no perfect settings for every mask, you just go with whatever gives you good results.
I gave this area a second pass just to see how how it turns out (mountains and the sky only): https://imgsli.com/MTkzNDUx - It looks fine, but it's way too sharp, so I will fix that later. First
The road might be a little harder since it extends throughout the background and foreground, but I think it should not be a problem. The entire untouched land is 768x768~px, so I should be fine with double upscale on this.
I will only mask the top layer of the land, ignoring the cliff side for now as it's a different subject.
Comparison of the inpaint: https://imgsli.com/MTkzOTA2
I had to give it multiple passes:
- inpaint the entire area
- inpaint the "road"
- inpaint lands separately (left, then right)
- inpaint the road again
I really dislike how noisy the realistic models are, really not my style, but what can you do. I will have to add some manual corrections later.
Also, at this point I had to switch to Sampler: DPM++ 2M SDE Karras / Steps: 30
to make use of the longer negative embeddings, because I'm working with a background that could use some more enhancement and is not really good in this case. Landscapes like this can be a little problematic, especially when there's grass/trees or foliage in general.
By the way, I only use two samplers. I added DPM++ 2M Karras
for a while just to compare it to SDE, but it doesn't seem to be that much different.
Now, the hardest part: water. I will probably have to mask the entire part, because the bottom is more detailed, so I hope this will even it out a little.
So, this is going to be the entire context - the entire water part will be masked. It's pretty big, when upscaled: 2112x1588px... I also wonder what will happen to the rock.
It will be a little harder, because the bridge is in the way, so some details might get destroyed.
and oh wow, OK, I did not expect this! https://imgsli.com/MTkzOTEw. This is perfect for the first pass, so I can just immediately move on.
I thought about starting from the center, but I changed my mind and I wanted to go back to the top-right, the tower just bothers me and it's annoying. I will also overlay an orange color a bit more and see what it does:
I have a feeling the new sky is terribly inaccurate, but hey, I'm not an artist ¯\_(ツ)_/¯
After brushing some fog in and re-generating the land, I had to step in and draw some hard spots and turn them into trees, because I didn't like how the land looked like: https://imgsli.com/MTkzOTEx
Moving on. I will now work on this entire area. I still don't like how noisy it got on the top side.
I will mask this area in a couple passes:
- top layers
- sides
- the road
Here's the comparison between the initial image and the inpainted area: https://imgsli.com/MTkzOTIx
I changed the colors from brown to gray, because I felt they fit more as rocky walls. The grass will be a bit hard to fix, but maybe I can come up with something later.
Now, the last two steps is to just fully mask the bridge and the tower separately and move on to the next phase, as there is not much to do right now.
I will just go with this:
Not much has changed, but now it'will be a bit easier to fix things. There's a lot of manual work to do here...
After the first big Inpainting session and color correction, it's time to compare the results to the initial generation. Not much was done with the colors to be honest, the second phase will require lots of lighting fixes... the filter is a way too strong, especially on the mountains part, but the second inpaint will even things out - hopefully.
- comparison: https://imgsli.com/MTk0MDkw
The only thing I'm not going to touch on this image is water, because it's already "perfect" and there's not much I can do with it apart from adding a color overlay, so this is the only part that does not require further inpainting. Maybe the rock could be adjusted a little, looks a bit noisy
First thing to do now will be to look at the image up close and list all the things that would use some adjustments, but before I do that, I will need a little helper image with the sunlight direction (how accurate this is, I don't really care, I can correct the lights to some degree at least)
A couple things are wrong with this picture:
- the sky on the right side of the tower doesn't look like it fits what's on the left side - it's a bit too dark and too detailed, but judging by the lar-left side of the image, I think it's almost fine, so maybe just a little brightness adjustment should be okay. The red tint doesn't look right, though, maybe it's just me
- the fog on the right ride is too dense, probably it might use some blending or more fog should be added to cover a bigger area
- the trees on the right side have to be re-generated, while it seemed obvious, I will probably need to redo this part from scratch, because the scale of the trees doesn't seem right
- the grass with bushes on the right side is way too noisy (also on the left side), this part will need some improvisation with hard sketching
- all roads have to be a bit more defined, the grass is "too blended" in some places
- there are no shadows casted behind the bridge on the land part
- the staircase has to be remade from scratch - judging by the scale, the steps are way too tall
- the tower needs manual adjustments to the entrance, the bush has to be removed, new window has to be drawn, the shadow on the right side has to be more defined, the balcony is "too thin" and its floor has to be smooth from the bottom, all lines should possibly be straight and all shapes have to be corrected
- the top part of the bridge wall either has to be smoothed out or the decorations have to be more defined = they kind of just look weird
- the entire horizon has too much sharpness and needs an orange/red tint
- the land on the left side is too bland, might need some re-sketching
- mountains on the left have too much sharpness and there is a visible edge at the bottom part of the mountains
There probably is more stuff, but I guess I will catch them on the second Quality Inspection pass
At this point it doesn't matter where I start inpainting, so I will pick at random and take the hardest part: the bridge. It requires some manual fixes...
After this manual adjustment with Dodge Tool
, I can give it a couple passes. I had to overexpose some parts, because I think SD will have a better time re-generating what I want, except the weird shapes on the wall might give me some trouble, but we will see.
- entire object
- top layer
- the bases
- arches
- upper part or the wall
For now I think it's good enough, the arches probably will require some more shadows later and the highlight is kind of gone, but maybe it looks more natural now. I think it's a bit too sharp. Also, not sure about the top layer with the bushes, kind of doesn't fit, but whatever.
Off to the tower now. Here, I will try a different approach with drawing instead of color dodging. Lots of fixes are needed...
(comparing to the initial image, because only the tower matters)
Well, it would be easier if I could draw, but that's the best I can do right now while trying to not spend much time on it, so I hope SD can make something good out of it:
Not much has changed, but this should do for now. The blur will later just be replaced with Clone Stamp tool. At least both the window and entrance look a bit better. The bottom-right part of the tower looks kinda weird, but I will come back to that later. The top is different, but I think it's fine. I'm starting to see why the scale is messed up. Just compare the trees to it...
There is something I really need to try by messing around with the Smudge tool. I just changed all the smaller mountains and I will try transforming it into some rock formation - lowering CNet weight to ~0.4. I will have to use the Detail Tweaker LoRA on 0.5 strength for this one only.
and the horizon:
But you know what? Something doesn't feel right. In fact, I don't actually like the mountains on the left... at this point I'm a little annoyed, because the more I look at the left side, the more I want to change it, and I think I was right. I gave that entire area a couple passes, regenerating almost everything. This is also a moment, where I just experiment with CNet Mode
and weight. I often do Balanced 0.4 weight
or Prompt importance 0.9 weight
combos if something doesn't want to work. There are some visible edges with poor blending, but I think I can live with that. I will probably adjust this later.
Nothing new will happen now, so I will just re-inpaint the remaining areas, because the process will literally by the same - you basically know everything at this point and I can skip to the end of the second Inpaint pass. It also turned out I shot myself in the foot with this one, trying to get the scale right with this perspective, but it just didn't work as I expected. That was not the point anyway, I just wanted to share how unnecessarily complex my workflow is and what you can do with it.
Also, I was really, really tempted to re-inpaint the river just to see what happens, and I totally missed the fact, that CNet Mode
might be important here, so I started to change between Balanced
and Prompt Importance
a little more with random weights and it turns out it's even more important.
Got some nice waters now (and a new rock):
After some more inpainting, the second pass is done and it's time for a comparison between the previous stage - not everything was inpainted. I will add the full comparison after the image passes the second Quality Inspection. Some further transformation was required, mainly on the tower and the sky. The tower was a bit of a failure, because it doesn't fit the bridge. But hey, the overall composition is still there. You will notice only two untouched spots, those on the right I actually liked. I thought the water was already perfect, but I decided to give it another go with some different CNet settings (again, just weight and mode experimentation). The only thing I can't fix is the horizon. Even with a hard sketch of the lighting SD just couldn't mirror the light properly and I don't really know how to mirror the reflection under perspective (on a flat line it's easier).
- comparison after some last minute tweaks: https://imgsli.com/MTk0ODA5
First, I will go back to the first Quality Inspection and make sure I didn't miss anything or just gave up on any of the listed issues:
- the entire sky has been re-generated - it's a bit less dark now, the highlight is a bit more subtle
- the right side is less foggy, it had to be re-generated completely. It was unfixable
- the trees look more like trees now (wow)
- all lands covered with grass have been re-generated, it's way less noisy. I got some flowers now
- the roads have been fixed except for the one going under the bridge, I think this one is fine as it is
- there is a slight shadow behind the bridge and the tower - still not sure how long it should be and how dark...
- after taking a closer look at the overall scale, I still wasn't sure if I actually had to redraw them, so I just re-generated them to get the correct level of detail and lighting
- the tower has been completely transformed again
- the bridge has been re-generated in a way to make the decoration look more natural
- new horizon, less sharpness
- the left side has been completely changed
With that out of the way, time for another quality inspection - less harsh, because it's getting way too long for just an example workflow showcase. There are still some minor issues, though. I will fix those issues immediately when spotted rather than listing them and going over the list. This should be much easier I think.
You can still look at the comparison image (right side) and check where I found more issues
- the base of the bridge had a weird blend of probably two other images, which was missed at some point
- the horizon was missing the vibrance boost after re-generating
- the left side was missing some vibrance boost from the sunlight (I do this through Sponge Tool with Vibrance ticked or without, depending which color space I need to boost)
- the tower had incorrect sky blend on the left side (near the mountains)
- there was a minor issue on in the tower structure on the right side (incomplete column) - the entire upper part was also re-generated
- the entrance needed some detail enhancement, the decorations were just a mushy mess of squiggly lines (I HATE seeing these obvious artifacts)
- the window also needed some more details, it was too noisy
- the rock needed a little highlight
- the water was too dark, added a little vibrance boost (probably a bit too much)
- saturation was missing on the tower window and the entrance
Comparison of a Before and After the Quality Inspection: https://imgsli.com/MTk0ODM2
I left out some minor bland color spots, like on the right side, but I don't think there is anything more I can do here except for some small shadow adjustments.
There are also many more places the image could be improved, but I decided to leave it as it is and just look for some small spots that could use just a tiny bit of highlighting and shadowing and since most of them were taken care of in earlier stages, there was almost nothing to adjust and nothing worth sharing to have a comparison.
This is the final step of my workflow and it's super important to get this right - I mostly don't, because I don't have a good eye for colors. You probably know this one from Adobe Lightroom.
Before I apply the filter, I really need to crop the image first. I usually do this earlier, but I couldn't decide if I wanted to crop the top or the bottom. At this point I will just crop the bottom to 1080 pixels, because I like the top of the tower, so Ctrl+Alt+C
and place the anchor at top-middle:
Now back to the filter - I cropped first, because I apply Vignetting through the filters and if I do this before cropping, the filter will be ruined (bottom edge will be deleted).
First thing I do with the filter is looking through the new presets and checking which ones look good or fit the overall mood of the image.
I usually check these presets first:
- Vivid
- Turquoise & Red
- Warm Contrast
- Warm Shadows
The first two go really well with fantasy artworks, where you need a good color boost - at least with most of them. I mostly swap between Warm Contrast
and Warm Shadows
and play with other settings
I avoid any detail/sharpness related filters, because it just destroys the image. People really love to over-sharpen their images and it looks really bad = high sharpness does not mean super high detail.
I will probably mess this up, because I use different monitor settings, so it won't look good for everyone else.
After applying the Warm Shadows
preset, it's time to play a little more with the settings. Basic
tab gives you the most control, so it's best to experiment with this a lot on your images. This is mostly just preference, so I go with whatever looks "ok" to me.
I rarely do this, but I needed some HSL
adjustments. Probably, because of the higher Saturation
in the Basic
tab, not sure. I also needed a higher Midpoint
for Vignetting
, because it was just a bit too dark.
In general, the filter was required to fix the temperature in this image. It didn't make any sense for the image to be high temperature. It just needed that ~-2000K switch.
Note: these settings apply specifically to this image only, you will have to use different settings on every image!
And finally, after all this unnecessary work and probably 6 hours in total, the image is ready from a complete scratch.
I can already tell you, that this image is still not done, because this is just a rushed workflow showcase, not a guide. I don't intend on writing guides for this, because there are new ways/tools introduced into AI world every week/month, that it's useless to write/record guides. All the stuff I use will be outdated really quick and I'm 80%~ sure my workflow was already outdated, because I haven't really changed anything in the past 4 months, but I don't really care.
There are still spots, that need improvements/fixing. Can you spot them now? There are highlighting errors, incorrect shadows (direction), blending errors, poor quality, glow. Probably more, but whatever. There also is a glitch on the bridge, which I didn't even notice. I could do a third Inpainting Pass on all smaller subjects, but this is good enough for an example image.
Now it's time for some comparisons:
- Finished image, no filter vs Camera Raw Filter applied: https://imgsli.com/MTk1MjMz
- Finished image vs initial drawing: https://imgsli.com/MTk1MjQ1
Yes, the composition changed a bit. I didn't really care what happens on the horizon, because I had no idea what to do there while drawing and the right side made more sense after some re-generation. The idea still remained the same, so that's a plus.
I didn't really like how the mountains on the left turned out after the first generation, so I decided to just change the composition. Think of it as a Client telling you, that he doesn't like the results of the image, and do something else instead.
This is most likely the point, where you say: all this is useless, image is small, all this work to get some mediocre sized image, which I can just generate, upscale with scripts, throw in some MultiDiffusion and I'm done in 5 minutes, having 4000 pixels wide/tall, sharp and detailed image!
Well, That's good and all, I don't really care. I enjoying spending a couple hours on something of an actual good quality in both small and raw scale.
Take a look at this for example. I spent 5 minutes in total on some image:
txt2img
with a random prompt from Lexica- high res. fix 2x upscale
Ultimate SD Upscale
, 3x resize unprompted on 0.1 denoise- adjust colors with
Camera Raw Filter
in Photoshop
Deliberately posting a small size
Looks awesome, right? It's a 3072 x 4608px
image downscaled back to 768~px. Well, for probably like 99% of people this would be a very beautiful, highly detailed masterpiece (heh), but for my needs, this is completely useless. Maybe that was a bit too extreme, so I will just downscale by 2:
This is the scale, that would be useful to me, except there are too many problems with this image:
- obviously, hands
- right thigh (left side)
- inconsistent material
- texture of the surfaces - you just can't tell whether some parts are skin, metal, glass, leather...
- all glowing parts need lighting and shape adjustments
The one issue I will ALWAYS have with just generated images is, that most of the details come out as this weird, unnatural and chaotic (or maybe abstract) texture, which in some way resemble a liquid metal(?) mixed with ugly blur and sometimes some sharp lines. I sometimes refer to this as Plasticine. This is literally it, maybe that's a more appropriate example.
Here's a raw scale example, which makes my point easier to understand. Just look at any surface, and you will immediately see what I mean.
I see this all over the place and it just triggers something in me when I see a nice image in some thumbnail, preview it and it's just absurdly upscaled image, where almost all details make no sense. Since I have full freedom with what's on the image and what can be done, I can at least put some more effort into making it look a little better.
On the below image, the first part is a raw scale, generated image. The second and third are inpainted with different parameters (only the inside parts of the thigh)
Last example, just one pass of Inpainting: https://imgsli.com/MTk2MzQ5
This part would first require manual shape adjustments all over the place, because you can just tell how off things look, but if I were to actually inpaint this, the results would be more accurate, because the shapes are just too chaotic. First inpaint pass is just to remove the AI noise, the second inpaint pass would properly resample the surfaces with more detail and after that, the dark filters with color adjustments would make it look good. That is quite some work, but it's definitely worth it.
But, you know what's an insta fix to this problem? Artistic styles, -2 weight Detail Tweaker LoRA
:) yes, this just screams anime (just an example).
Don't get me wrong, I'm not straight up saying, that other people make garbage quality images, their images are bad, etc. What I'm saying is, that I personally enjoy good quality of my own results, so that's why I'm super critical about some parts on my images. I don't judge the quality of the content other people make. Sure, it sometimes may trigger me, when someone says an image is super detailed and sharp, and it looks like what I described somewhere earlier. Everyone sees things differently, so, in the end, it doesn't matter.
My images are no better, there are lots of mistakes, mostly when it comes to the colors and lighting, but I don't really care. I enjoy doing this, even if it takes 100 times more effort than just prompting and upscaling. When I have the opportunity to make something with a higher level of detail and eventually make a use of it, then it's awesome.
I don't know if I already said this, but the point of all this was not to teach you to do things as I do or to tell you, that you have to do things my way, because it's the best. There is no best workflow, people just have their preferred way of working and nothing will change that. My workflow is not even optimal for production. I'm not an artist, so I just assume something looks right, when mostly it does not.
If you actually were insane enough to read the entire thing and managed to endure my biased takes, thank you for your time and I hope you learned at least something from this 160k characters long rambling.
If you want to know more about what stuff I make with this workflow, you can check out my media shared on Twitter. The only things I don't post there are game dev assets, as I can't share those, except when I start doing random things I won't use.
Happy Inpainting :)