mrbid/text-to-3d_manual_setup.txt

## text-to-3d_manual_setup.txt
Zero123plus (https://github.com/SUDO-AI-3D/zero123plus) output views are a fixed set of camera poses:

Azimuth (relative to input view): 30, 90, 150, 210, 270, 330.
Elevation (absolute): 30, -20, 30, -20, 30, -20.

To generate the images from Zero123++ it's easier to just use:
https://huggingface.co/spaces/sudo-ai/zero123plus-demo-space
and enable both background removals

StableSAM is usually used to remove backgrounds:
https://github.com/abhishekkrthakur/StableSAM
otherwise you can also try to magic select them in GIMP or Krita
or even use a lower quality network such as https://huggingface.co/spaces/Xenova/remove-background-web

https://ezgif.com/sprite-cutter/
can quickly cut the multi-image into individual images.

MVDream could be considered better than Zero123++ and allows custom angles, the main difference
with Zero123++ and MVDream is that with MVDream you have to start with a text prompt and with
Zero123++ you have to start with an input image.
https://github.com/bytedance/MVDream

Zero123++ outputs 320x320 images and MVDream outputs 256x256 images.

While the transforms.json allows you to specify a depth map and Omnidata is generally used to
produce depth and normal maps for text-to-3D it tends to make little to no difference:
https://github.com/EPFL-VILAB/omnidata/tree/main/omnidata_tools/torch

But don't take my word for it because the depth map seems to be an important step in Stable-Dreamfusion:
https://github.com/ashawkey/stable-dreamfusion

Although you can also get a depth map from MiDaS and ZoeDepth, I'm not sure exactly what types
of depth map are supported by either nerf.studio or instant-ngp.
https://github.com/isl-org/MiDaS
https://github.com/isl-org/ZoeDepth

The transforms.json file is partly documented at these urls:
https://docs.nerf.studio/quickstart/data_conventions.html
https://github.com/NVlabs/instant-ngp/blob/master/docs/nerf_dataset_tips.md

--- highlights

{
  "camera_model": "OPENCV_FISHEYE", // camera model type [OPENCV, OPENCV_FISHEYE]
  "fl_x": 1072.0, // focal length x
  "fl_y": 1068.0, // focal length y
  "cx": 1504.0, // principal point x
  "cy": 1000.0, // principal point y
  "w": 3008, // image width
  "h": 2000, // image height
  "k1": 0.0312, // first radial distortion parameter, used by [OPENCV, OPENCV_FISHEYE]
  "k2": 0.0051, // second radial distortion parameter, used by [OPENCV, OPENCV_FISHEYE]
  "k3": 0.0006, // third radial distortion parameter, used by [OPENCV_FISHEYE]
  "k4": 0.0001, // fourth radial distortion parameter, used by [OPENCV_FISHEYE]
  "p1": -6.47e-5, // first tangential distortion parameter, used by [OPENCV]
  "p2": -1.37e-7, // second tangential distortion parameter, used by [OPENCV]
  "frames": // ... per-frame intrinsics and extrinsics parameters
}

{
  // ...
  "frames": [
    {
      "file_path": "images/frame_00001.jpeg",
      "transform_matrix": [
        // [+X0 +Y0 +Z0 X]
        // [+X1 +Y1 +Z1 Y]
        // [+X2 +Y2 +Z2 Z]
        // [0.0 0.0 0.0 1]
        [1.0, 0.0, 0.0, 0.0],
        [0.0, 1.0, 0.0, 0.0],
        [0.0, 0.0, 1.0, 0.0],
        [0.0, 0.0, 0.0, 1.0]
      ]
      // Additional per-frame info
    }
  ]
}

The aabb_scale parameter causes the NeRF implementation to trace rays out
to a larger or smaller bounding box containing the background elements.
This value seems like it needs to be a multiple of two.

---

What isn't documented by those URL's is the rotation parameter which seems
to be a normalised value of PI*2. It's hard to know what values can be
omitted and which can't as it seems to vary from purpose to purpose but
many NeRF files do contain this rotation parameter as shown below.

{
    "camera_angle_x": 0.6194058656692505,
    "frames": [
        {
            "file_path": "./train/r_0",
            "rotation": 0.012566370614359171,
            "transform_matrix": [
                [
                    -0.9754950404167175,
                    -0.1484755426645279,
                    -0.16237139701843262,
                    -0.6545401215553284
                ],
                [
                    -0.22002151608467102,
                    0.6582863330841064,
                    0.7198954820632935,
                    2.901991844177246
                ],
                [
                    0.0,
                    0.7379797697067261,
                    -0.6748228073120117,
                    -2.7202980518341064
                ],
                [
                    0.0,
                    0.0,
                    0.0,
                    1.0
                ]
            ]
        }
    ]
}

---

The Blender NeRF plugin might help in generating the camera matricies for the transforms.json file
https://github.com/maximeraafat/BlenderNeRF

---

COLMAP (https://github.com/colmap/colmap) is a really great piece of
software for Structure-from-Motion (SfM) and Multi-View Stereo (MVS)
however it is not suitable for use on Zero123plus outputs to generate
a transforms.json because there is no motion only rotations.
sudo apt install colmap

It's also worth mentioning that nerf.studio wont load a transforms.json
unless you specify the image type in the "file_path" parameter, as where
instant-ngp (https://github.com/NVlabs/instant-ngp) will, none of the
original NeRF samples specify the filetype. (https://github.com/bmild/nerf)

The original NeRF paper (https://www.matthewtancik.com/nerf) dataset can be downloaded here:
https://drive.google.com/drive/folders/128yBriW1IG_3NJ5Rp7APSTZsJqdJdfc1?usp=sharing

---

Ratinod has an example transforms.json that he created specifically for the Zero123++ datset:
https://github.com/SUDO-AI-3D/zero123plus/issues/11#issuecomment-1781951276
I loaded it into nerf.studio and the camera transforms look ok, it's hard to be completely
sure as in it's current state nerf.studio gives very little statistical information about
cameras, it only shows a visual representation of their orientation.

It would really help if nerf.studio or instant-ngp allowed you to modify the camera transforms
inside the GUI software until it visually looked correct, it seems like Taichi a CPU-only version
of instant-ngp written in Python allows you to do this but I haven't had much luck getting it to work yet.
https://github.com/Linyou/taichi-ngp-renderer
https://github.com/kwea123/ngp_pl
https://github.com/Kai-46/nerfplusplus

This saves some time setting up the transforms:
https://www.andre-gaschler.com/rotationconverter/

I am not sure if the latent-nerf project will perform better on Zero123plus outputs than instant-ngp does:
https://github.com/eladrich/latent-nerf

---

https://huggingface.co/spaces/LiheYoung/Depth-Anything
https://huggingface.co/spaces/bookbot/Image-Upscaling-Playground
https://huggingface.co/spaces/hongfz16/3DTopia
https://huggingface.co/spaces/stabilityai/TripoSR
https://huggingface.co/spaces/flamehaze1115/Wonder3D-demo
https://huggingface.co/spaces/liuyuan-pal/SyncDreamer
https://huggingface.co/spaces/sudo-ai/zero123plus-demo-space

https://github.com/naver/dust3r - Like instant-ngp but calculates the transforms.json, depth, etc for you.

---

To be continued...

In the meantime I have an introductory article: https://ai.plainenglish.io/text-to-3d-b607bf245031
and an Itch.io mega-thread on the topic: https://itch.io/t/3519795/share-your-favourite-sources-of-free-3d-content-for-games
mirrored here: https://gist.github.com/mrbid/6a01c854b9279310f95d5601a8215574
	Zero123plus (https://github.com/SUDO-AI-3D/zero123plus) output views are a fixed set of camera poses:

	Azimuth (relative to input view): 30, 90, 150, 210, 270, 330.
	Elevation (absolute): 30, -20, 30, -20, 30, -20.

	To generate the images from Zero123++ it's easier to just use:
	https://huggingface.co/spaces/sudo-ai/zero123plus-demo-space
	and enable both background removals

	StableSAM is usually used to remove backgrounds:
	https://github.com/abhishekkrthakur/StableSAM
	otherwise you can also try to magic select them in GIMP or Krita
	or even use a lower quality network such as https://huggingface.co/spaces/Xenova/remove-background-web

	https://ezgif.com/sprite-cutter/
	can quickly cut the multi-image into individual images.

	MVDream could be considered better than Zero123++ and allows custom angles, the main difference
	with Zero123++ and MVDream is that with MVDream you have to start with a text prompt and with
	Zero123++ you have to start with an input image.
	https://github.com/bytedance/MVDream

	Zero123++ outputs 320x320 images and MVDream outputs 256x256 images.

	While the transforms.json allows you to specify a depth map and Omnidata is generally used to
	produce depth and normal maps for text-to-3D it tends to make little to no difference:
	https://github.com/EPFL-VILAB/omnidata/tree/main/omnidata_tools/torch

	But don't take my word for it because the depth map seems to be an important step in Stable-Dreamfusion:
	https://github.com/ashawkey/stable-dreamfusion

	Although you can also get a depth map from MiDaS and ZoeDepth, I'm not sure exactly what types
	of depth map are supported by either nerf.studio or instant-ngp.
	https://github.com/isl-org/MiDaS
	https://github.com/isl-org/ZoeDepth

	The transforms.json file is partly documented at these urls:
	https://docs.nerf.studio/quickstart/data_conventions.html
	https://github.com/NVlabs/instant-ngp/blob/master/docs/nerf_dataset_tips.md

	--- highlights

	{
	"camera_model": "OPENCV_FISHEYE", // camera model type [OPENCV, OPENCV_FISHEYE]
	"fl_x": 1072.0, // focal length x
	"fl_y": 1068.0, // focal length y
	"cx": 1504.0, // principal point x
	"cy": 1000.0, // principal point y
	"w": 3008, // image width
	"h": 2000, // image height
	"k1": 0.0312, // first radial distortion parameter, used by [OPENCV, OPENCV_FISHEYE]
	"k2": 0.0051, // second radial distortion parameter, used by [OPENCV, OPENCV_FISHEYE]
	"k3": 0.0006, // third radial distortion parameter, used by [OPENCV_FISHEYE]
	"k4": 0.0001, // fourth radial distortion parameter, used by [OPENCV_FISHEYE]
	"p1": -6.47e-5, // first tangential distortion parameter, used by [OPENCV]
	"p2": -1.37e-7, // second tangential distortion parameter, used by [OPENCV]
	"frames": // ... per-frame intrinsics and extrinsics parameters
	}

	{
	// ...
	"frames": [
	{
	"file_path": "images/frame_00001.jpeg",
	"transform_matrix": [
	// [+X0 +Y0 +Z0 X]
	// [+X1 +Y1 +Z1 Y]
	// [+X2 +Y2 +Z2 Z]
	// [0.0 0.0 0.0 1]
	[1.0, 0.0, 0.0, 0.0],
	[0.0, 1.0, 0.0, 0.0],
	[0.0, 0.0, 1.0, 0.0],
	[0.0, 0.0, 0.0, 1.0]
	]
	// Additional per-frame info
	}
	]
	}

	The aabb_scale parameter causes the NeRF implementation to trace rays out
	to a larger or smaller bounding box containing the background elements.
	This value seems like it needs to be a multiple of two.

	---

	What isn't documented by those URL's is the rotation parameter which seems
	to be a normalised value of PI*2. It's hard to know what values can be
	omitted and which can't as it seems to vary from purpose to purpose but
	many NeRF files do contain this rotation parameter as shown below.

	{
	"camera_angle_x": 0.6194058656692505,
	"frames": [
	{
	"file_path": "./train/r_0",
	"rotation": 0.012566370614359171,
	"transform_matrix": [
	[
	-0.9754950404167175,
	-0.1484755426645279,
	-0.16237139701843262,
	-0.6545401215553284
	],
	[
	-0.22002151608467102,
	0.6582863330841064,
	0.7198954820632935,
	2.901991844177246
	],
	[
	0.0,
	0.7379797697067261,
	-0.6748228073120117,
	-2.7202980518341064
	],
	[
	0.0,
	0.0,
	0.0,
	1.0
	]
	]
	}
	]
	}

	---

	The Blender NeRF plugin might help in generating the camera matricies for the transforms.json file
	https://github.com/maximeraafat/BlenderNeRF

	---

	COLMAP (https://github.com/colmap/colmap) is a really great piece of
	software for Structure-from-Motion (SfM) and Multi-View Stereo (MVS)
	however it is not suitable for use on Zero123plus outputs to generate
	a transforms.json because there is no motion only rotations.
	sudo apt install colmap

	It's also worth mentioning that nerf.studio wont load a transforms.json
	unless you specify the image type in the "file_path" parameter, as where
	instant-ngp (https://github.com/NVlabs/instant-ngp) will, none of the
	original NeRF samples specify the filetype. (https://github.com/bmild/nerf)

	The original NeRF paper (https://www.matthewtancik.com/nerf) dataset can be downloaded here:
	https://drive.google.com/drive/folders/128yBriW1IG_3NJ5Rp7APSTZsJqdJdfc1?usp=sharing

	---

	Ratinod has an example transforms.json that he created specifically for the Zero123++ datset:
	https://github.com/SUDO-AI-3D/zero123plus/issues/11#issuecomment-1781951276
	I loaded it into nerf.studio and the camera transforms look ok, it's hard to be completely
	sure as in it's current state nerf.studio gives very little statistical information about
	cameras, it only shows a visual representation of their orientation.

	It would really help if nerf.studio or instant-ngp allowed you to modify the camera transforms
	inside the GUI software until it visually looked correct, it seems like Taichi a CPU-only version
	of instant-ngp written in Python allows you to do this but I haven't had much luck getting it to work yet.
	https://github.com/Linyou/taichi-ngp-renderer
	https://github.com/kwea123/ngp_pl
	https://github.com/Kai-46/nerfplusplus

	This saves some time setting up the transforms:
	https://www.andre-gaschler.com/rotationconverter/

	I am not sure if the latent-nerf project will perform better on Zero123plus outputs than instant-ngp does:
	https://github.com/eladrich/latent-nerf

	---

	https://huggingface.co/spaces/LiheYoung/Depth-Anything
	https://huggingface.co/spaces/bookbot/Image-Upscaling-Playground
	https://huggingface.co/spaces/hongfz16/3DTopia
	https://huggingface.co/spaces/stabilityai/TripoSR
	https://huggingface.co/spaces/flamehaze1115/Wonder3D-demo
	https://huggingface.co/spaces/liuyuan-pal/SyncDreamer
	https://huggingface.co/spaces/sudo-ai/zero123plus-demo-space

	https://github.com/naver/dust3r - Like instant-ngp but calculates the transforms.json, depth, etc for you.

	---

	To be continued...

	In the meantime I have an introductory article: https://ai.plainenglish.io/text-to-3d-b607bf245031
	and an Itch.io mega-thread on the topic: https://itch.io/t/3519795/share-your-favourite-sources-of-free-3d-content-for-games
	mirrored here: https://gist.github.com/mrbid/6a01c854b9279310f95d5601a8215574