IP-Adapter を試した

画像をプロンプトとして使える IP-Adapter を試しました。

IP-Adapter

まず IP (Image Prompt) として使うための画像を準備します。今回は、あの有名な長沢芦雪の犬（白象黒牛図屏風）を使うことにします。

IP Adapter はスクエア画像が必要だった気がする（要確認）ので、画像サイズを 512 x 512 にします。今回は InDesign の Outpainting で直しました。（詳細割愛）

それでは、この画像とテキスト（ a polar bear sitting in a chair ）をプロンプトとして次のコードで画像を生成します。

コードはこのページ https://huggingface.co/docs/diffusers/en/using-diffusers/ip_adapter を参考にしています。

#
# main.py
#

from diffusers import AutoPipelineForText2Image
from PIL import Image
import torch

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipeline = AutoPipelineForText2Image.from_pretrained(model_id, torch_dtype=torch.float16).to("mps")

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
pipeline.set_ip_adapter_scale(0.6)

image = Image.open("the-rosetsu-dog.jpg")

generator = torch.Generator(device="cpu").manual_seed(0)

images = pipeline(
    prompt="a polar bear sitting in a chair",
    ip_adapter_image=image,
    negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
    num_inference_steps=100,
    generator=generator,
).images

images[0].save("polar-bear.png", lossless=True, quality=100)

実際にこの main.py を動かすための環境を準備していきます。

OS は macOS ( Mac Studio M1 Max 32GB, Sonoma 15.2)、 Python のバージョンはこれ:

$ python --version
Python 3.12.9

venv した上で、pip を upgrade:

$ pip install --upgrade pip

その後、必要なライブラリを入れます:

$ pip install diffusers transformers
$ pip install torch torchvision torchaudio
$ pip install accelerate

芦雪の犬画像 the-rosetsu-dog.jpg をカレントディレクトリに配置してから main.py を実行します:

$ python main.py

実行には 5分弱(267秒) かかりました。

A polar bear

今度は次のテキストプロンプトにさしかえて実行してみます。

a fluffy sheltie dog sitting in a chair

A fluffy sheltie dog

requirements.txt

最後に requirements.txt をメモしておきます。

accelerate==1.5.2
certifi==2025.1.31
charset-normalizer==3.4.1
diffusers==0.32.2
filelock==3.18.0
fsspec==2025.3.0
huggingface-hub==0.29.3
idna==3.10
importlib_metadata==8.6.1
Jinja2==3.1.6
MarkupSafe==3.0.2
mpmath==1.3.0
networkx==3.4.2
numpy==2.2.3
packaging==24.2
pillow==11.1.0
psutil==7.0.0
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
safetensors==0.5.3
setuptools==76.0.0
sympy==1.13.1
tokenizers==0.21.1
torch==2.6.0
torchaudio==2.6.0
torchvision==0.21.0
tqdm==4.67.1
transformers==4.49.0
typing_extensions==4.12.2
urllib3==2.3.0
zipp==3.21.0

追伸 Ubuntu + CUDA

Ubuntu 24.04 LTS Server と CUDA の組み合わせでも試してみます。 GPU は NVIDIA GeForce RTX 3060 (12GB) です。

venv を用意した上で:

$ pip install --upgrade pip
$ pip install -r requirements.txt

pip install -r requirements.txt する代わりに次のようにした方がよかったかもしれない。
$ pip install diffusers transformers
$ pip install torch torchvision torchaudio
$ pip install accelerate

#
# main_cuda.py
#

from diffusers import AutoPipelineForText2Image
from PIL import Image
import torch

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipeline = AutoPipelineForText2Image.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")

pipeline.enable_sequential_cpu_offload()
pipeline.enable_vae_slicing()

pipeline.set_ip_adapter_scale(0.6)

image = Image.open("the-rosetsu-dog.jpg")

generator = torch.Generator(device="cpu").manual_seed(0)

images = pipeline(
    prompt="a polar bear sitting in a chair",
    ip_adapter_image=image,
    negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
    num_inference_steps=100,
    generator=generator,
).images

images[0].save("polar-bear_cuda.png", lossless=True, quality=100)

macOS( mps )版のコードと cuda 対応版それとの差分:

10c10
< pipeline = AutoPipelineForText2Image.from_pretrained(model_id, torch_dtype=torch.float16).to("mps")
---
> pipeline = AutoPipelineForText2Image.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
12a13,16
>
> pipeline.enable_sequential_cpu_offload()
> pipeline.enable_vae_slicing()
>

mps を cuda に変更して次のコードを追加しただけです。

pipeline.enable_sequential_cpu_offload()
pipeline.enable_vae_slicing()

この処理を追加しないと 12GB の GPU メモリではメモリ不足のエラーが起きた。

A polar bear (CUDA)

シードが同じなので当然ですが、macOS で処理したときと同じ絵ができました。

同じ絵をつくってもおもしろくない。次のテキストプロンプトに変えて実行してみます。

a polar bear sitting in a chair drinking a cup of coffee

結果の画像:

A polar bear drinking a cup of coffee (CUDA)

くまがコーヒーカップを持っている絵ができました。こぼれたコーヒーが胸の部分に付いているのがかわいい。

なお処理時間は 4分程度でした。 Mac Studio M1 Max と比べて多少は速いもの大差ない。

以上です。