Skip to content

Instantly share code, notes, and snippets.

@mrbid
Last active March 13, 2024 03:09
Show Gist options
  • Save mrbid/99e07fdf87d96a9dcf4678e985cd27dc to your computer and use it in GitHub Desktop.
Save mrbid/99e07fdf87d96a9dcf4678e985cd27dc to your computer and use it in GitHub Desktop.
manually configuring a text-to-3d pipeline
Zero123plus (https://github.com/SUDO-AI-3D/zero123plus) output views are a fixed set of camera poses:
Azimuth (relative to input view): 30, 90, 150, 210, 270, 330.
Elevation (absolute): 30, -20, 30, -20, 30, -20.
To generate the images from Zero123++ it's easier to just use:
https://huggingface.co/spaces/sudo-ai/zero123plus-demo-space
and enable both background removals
StableSAM is usually used to remove backgrounds:
https://github.com/abhishekkrthakur/StableSAM
otherwise you can also try to magic select them in GIMP or Krita
or even use a lower quality network such as https://huggingface.co/spaces/Xenova/remove-background-web
https://ezgif.com/sprite-cutter/
can quickly cut the multi-image into individual images.
MVDream could be considered better than Zero123++ and allows custom angles, the main difference
with Zero123++ and MVDream is that with MVDream you have to start with a text prompt and with
Zero123++ you have to start with an input image.
https://github.com/bytedance/MVDream
Zero123++ outputs 320x320 images and MVDream outputs 256x256 images.
While the transforms.json allows you to specify a depth map and Omnidata is generally used to
produce depth and normal maps for text-to-3D it tends to make little to no difference:
https://github.com/EPFL-VILAB/omnidata/tree/main/omnidata_tools/torch
But don't take my word for it because the depth map seems to be an important step in Stable-Dreamfusion:
https://github.com/ashawkey/stable-dreamfusion
Although you can also get a depth map from MiDaS and ZoeDepth, I'm not sure exactly what types
of depth map are supported by either nerf.studio or instant-ngp.
https://github.com/isl-org/MiDaS
https://github.com/isl-org/ZoeDepth
The transforms.json file is partly documented at these urls:
https://docs.nerf.studio/quickstart/data_conventions.html
https://github.com/NVlabs/instant-ngp/blob/master/docs/nerf_dataset_tips.md
--- highlights
{
"camera_model": "OPENCV_FISHEYE", // camera model type [OPENCV, OPENCV_FISHEYE]
"fl_x": 1072.0, // focal length x
"fl_y": 1068.0, // focal length y
"cx": 1504.0, // principal point x
"cy": 1000.0, // principal point y
"w": 3008, // image width
"h": 2000, // image height
"k1": 0.0312, // first radial distortion parameter, used by [OPENCV, OPENCV_FISHEYE]
"k2": 0.0051, // second radial distortion parameter, used by [OPENCV, OPENCV_FISHEYE]
"k3": 0.0006, // third radial distortion parameter, used by [OPENCV_FISHEYE]
"k4": 0.0001, // fourth radial distortion parameter, used by [OPENCV_FISHEYE]
"p1": -6.47e-5, // first tangential distortion parameter, used by [OPENCV]
"p2": -1.37e-7, // second tangential distortion parameter, used by [OPENCV]
"frames": // ... per-frame intrinsics and extrinsics parameters
}
{
// ...
"frames": [
{
"file_path": "images/frame_00001.jpeg",
"transform_matrix": [
// [+X0 +Y0 +Z0 X]
// [+X1 +Y1 +Z1 Y]
// [+X2 +Y2 +Z2 Z]
// [0.0 0.0 0.0 1]
[1.0, 0.0, 0.0, 0.0],
[0.0, 1.0, 0.0, 0.0],
[0.0, 0.0, 1.0, 0.0],
[0.0, 0.0, 0.0, 1.0]
]
// Additional per-frame info
}
]
}
The aabb_scale parameter causes the NeRF implementation to trace rays out
to a larger or smaller bounding box containing the background elements.
This value seems like it needs to be a multiple of two.
---
What isn't documented by those URL's is the rotation parameter which seems
to be a normalised value of PI*2. It's hard to know what values can be
omitted and which can't as it seems to vary from purpose to purpose but
many NeRF files do contain this rotation parameter as shown below.
{
"camera_angle_x": 0.6194058656692505,
"frames": [
{
"file_path": "./train/r_0",
"rotation": 0.012566370614359171,
"transform_matrix": [
[
-0.9754950404167175,
-0.1484755426645279,
-0.16237139701843262,
-0.6545401215553284
],
[
-0.22002151608467102,
0.6582863330841064,
0.7198954820632935,
2.901991844177246
],
[
0.0,
0.7379797697067261,
-0.6748228073120117,
-2.7202980518341064
],
[
0.0,
0.0,
0.0,
1.0
]
]
}
]
}
---
The Blender NeRF plugin might help in generating the camera matricies for the transforms.json file
https://github.com/maximeraafat/BlenderNeRF
---
COLMAP (https://github.com/colmap/colmap) is a really great piece of
software for Structure-from-Motion (SfM) and Multi-View Stereo (MVS)
however it is not suitable for use on Zero123plus outputs to generate
a transforms.json because there is no motion only rotations.
sudo apt install colmap
It's also worth mentioning that nerf.studio wont load a transforms.json
unless you specify the image type in the "file_path" parameter, as where
instant-ngp (https://github.com/NVlabs/instant-ngp) will, none of the
original NeRF samples specify the filetype. (https://github.com/bmild/nerf)
The original NeRF paper (https://www.matthewtancik.com/nerf) dataset can be downloaded here:
https://drive.google.com/drive/folders/128yBriW1IG_3NJ5Rp7APSTZsJqdJdfc1?usp=sharing
---
Ratinod has an example transforms.json that he created specifically for the Zero123++ datset:
https://github.com/SUDO-AI-3D/zero123plus/issues/11#issuecomment-1781951276
I loaded it into nerf.studio and the camera transforms look ok, it's hard to be completely
sure as in it's current state nerf.studio gives very little statistical information about
cameras, it only shows a visual representation of their orientation.
It would really help if nerf.studio or instant-ngp allowed you to modify the camera transforms
inside the GUI software until it visually looked correct, it seems like Taichi a CPU-only version
of instant-ngp written in Python allows you to do this but I haven't had much luck getting it to work yet.
https://github.com/Linyou/taichi-ngp-renderer
https://github.com/kwea123/ngp_pl
https://github.com/Kai-46/nerfplusplus
This saves some time setting up the transforms:
https://www.andre-gaschler.com/rotationconverter/
I am not sure if the latent-nerf project will perform better on Zero123plus outputs than instant-ngp does:
https://github.com/eladrich/latent-nerf
---
https://huggingface.co/spaces/LiheYoung/Depth-Anything
https://huggingface.co/spaces/bookbot/Image-Upscaling-Playground
https://huggingface.co/spaces/hongfz16/3DTopia
https://huggingface.co/spaces/stabilityai/TripoSR
https://huggingface.co/spaces/flamehaze1115/Wonder3D-demo
https://huggingface.co/spaces/liuyuan-pal/SyncDreamer
https://huggingface.co/spaces/sudo-ai/zero123plus-demo-space
https://github.com/naver/dust3r - Like instant-ngp but calculates the transforms.json, depth, etc for you.
---
To be continued...
In the meantime I have an introductory article: https://ai.plainenglish.io/text-to-3d-b607bf245031
and an Itch.io mega-thread on the topic: https://itch.io/t/3519795/share-your-favourite-sources-of-free-3d-content-for-games
mirrored here: https://gist.github.com/mrbid/6a01c854b9279310f95d5601a8215574
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment