Skip to content

Instantly share code, notes, and snippets.

@swyxio
Created November 16, 2023 22:14
Show Gist options
  • Save swyxio/0979b9908f599f4a284f30ca98806438 to your computer and use it in GitHub Desktop.
Save swyxio/0979b9908f599f4a284f30ca98806438 to your computer and use it in GitHub Desktop.
## Guild: [Nous Research AI](https://discord.com/channels/1053877538025386074)
### Nous Research AI Guild Summary
- Discussion on massive image data hosting: Members of the channel explore different options like Amazon S3, local storage, and Hugging Face for hosting TBs of image data from midjourney. The group suggests using **Hugging Face** due to its free storage and high file size limit but acknowledges the risk of a single point of failure. A relevant [YouTube video](https://youtu.be/gqw46IcPxfY) and a [discussion post from Hugging Face](https://discuss.huggingface.co/t/is-there-a-size-limit-for-dataset-hosting/14861/3) were shared for more insights.
- An engaging dialogue took place concerning AI and music transformation, with reference to Google's project, followed by deep disappointment in AI cutting-edge technology not being open source. AI's potential in game playing, especially at pixel-level was also touched upon through sharing an [old Python project](https://www.youtube.com/watch?v=eQC1JGMIxU0).
- Notable references include a "Skeleton of thought" paper and a similar "Tree of thought" concept for dataset generation. A [link](https://vxtwitter.com/migtissera/status/1725028677124235288) was provided for more understanding.
- A [new Text-to-Video research](https://emu-video.metademolab.com/assets/emu_video_v1.pdf) from Meta was discussed by members, comparing it to 'animate-diff', and the Sinthia-v1.3 dataset available on [Huggingface](https://huggingface.co/datasets/migtissera/Synthia-v1.3) was shared.
- OpenHermes 2.5 made an exciting announcement of achieving a high ranking on the HF Leaderboard, securing second place in the 7B models category.
- Interesting probability questions were posed as challenges, and satisfactory feedback was provided for **Claude v2**.
- Members of the general channel shared a variety of topics ranging from AI music generation, fine-tuning and training difficulties, operations of Nous Research, to AI models training resources guiding towards Axolotl, and urging contributions to "gptslop" on GitHub.
- The release of a new Capybara 34B API and playground was excitedly [introduced](https://openrouter.ai/models/nousresearch/nous-capybara-34b).
- The origin and purpose of LLMs were discussed among users, touching on such topics as Rust code analysis, full finetuning versus continued pre-training, non-Roman language training, and application in fiction text data. The role of Chaos in AI was humorously mentioned.
- Through a shared [link](https://huggingface.co/learn/nlp-course/chapter6/2) from Hugging Face's NLP course, it's guided on how to train a tokenizer.
- Reactions were provided for a post titled "[Artificial Intelligence can't deal with Chaos](https://news.ycombinator.com/item?id=36225723)" from the Y Combinator News.
**Nous Research AI Channel Summaries**
### Nous Research AI Channel: [off-topic](https://discord.com/channels/1053877538025386074/1109649177689980928) (7 messages🔥):
**Massive Image Data Hosting Discussion**:
- Members @yorth_night, @.wooser, @benxh, @crainmaker, and @tsunemoto discuss options to store and handle terabytes of image data from midjourney. Initial options mentioned are Amazon S3 and local storage, but they conclude that costs and capacity are roadblocks.
- @yorth_night indicates the data amount could potentially be dozens of terabytes, consisting of ten million images with their prompts.
- @benxh suggests using Hugging Face as a platform, as it allows datasets to be streamed, eliminating the need for disk space.
- @crainmaker confirms that Hugging Face offers a per-file limit of 50GB with no overall limit, according to a [discussion post](https://discuss.huggingface.co/t/is-there-a-size-limit-for-dataset-hosting/14861/3) on the Hugging Face forum.
- @tsunemoto and @crainmaker discuss creating smaller batches of parquet, considering MD5 hash instead of full image data, but acknowledge that storage for a large image dataset is still required.
- @benxh recommends using the huggingfacehub Python implementation to push each parquet file as soon as it's saved.
- The participants ultimately decide to go with Hugging Face due to its free storage and high file size limit. However, @crainmaker notes that relying on one platform can pose a risk of a single point of failure.
- @pradeep1148 shares a [YouTube video link](https://youtu.be/gqw46IcPxfY), but its relevance or context is not discussed.
### Nous Research AI Channel: [interesting-links](https://discord.com/channels/1053877538025386074/1132352574750728192) (7 messages🔥):
**AI and Music Transformation**:
- **Transforming Singing into Orchestral**: @yorth_night and @.wooser discussed about the potential of AI in music creation and transformation, particularly inspired by Google's project. A link to the [related blog post](https://deepmind.google/discover/blog/transforming-the-future-of-music-creation/) was shared. @ldj shared a [YouTube video](https://youtu.be/rrk1t_h2iSQ?si=njkk-ajNonaiTum4) showcasing the technology.
- **Dreams of Open Source**: Both @yorth_night and @.wooser expressed disappointment that this cutting-edge technology isn't open source currently and hope to see it in the future.
**Skeleton of Thought and Tree of Thought**:
- **Skeleton and Tree of Thought**: @georgejrjrjr referenced the "Skeleton of Thought" paper. @yorth_night shared a link about a similar concept called the [Tree of thought for dataset generation](https://vxtwitter.com/migtissera/status/1725028677124235288).
**AI Game Playing**:
- **Pixel-Level AI Game Playing**: @f3l1p3_lv presented an [old Python project](https://www.youtube.com/watch?v=eQC1JGMIxU0) where a neural network is used to play a game by seeing every pixel of the game window.
**Text-to-Video model from Meta**:
- **Meta's New Text-to-Video Model**: @tsunemoto posted links to [new research](https://emu-video.metademolab.com/assets/emu_video_v1.pdf) from Meta on Text-to-Video models. Further discussion on this work was had by @teknium and @qasb, with a comparison made to the process of 'animate-diff'.
**Datasets Link**:
- **Sinthia-v1.3 Dataset**: @teknium shared a link to the [Sinthia-v1.3 dataset on Huggingface](https://huggingface.co/datasets/migtissera/Synthia-v1.3).
### Nous Research AI Channel: [announcements](https://discord.com/channels/1053877538025386074/1145143867818119272) (7 messages🔥):
**OpenHermes 2.5 Ranking on HF Leaderboard**:
- @teknium excitedly announced that **OpenHermes 2.5** has achieved a high ranking on the HF Leaderboard, securing a second-place position in the 7B models category.
### Nous Research AI Channel: [bots](https://discord.com/channels/1053877538025386074/1149866614590816256) (7 messages🔥):
**Probabilistic Challenges and Feedback on Claude v2**:
- **Probability Questions**: @f3l1p3_lv posed several probability challenges:
- In a classroom with a certain number of students split by gender and eyeglasses usage, what's the probability that a random student will be a woman who doesn't wear glasses? The options given were A) 40%, B) 12%, C) 60% and D) 16%.
- Two non-biased dice are thrown, and both numbers are odd. What's the probability their sum will be 8? The options given were A) 2/36, B) 1/6, C) 2/9, D) 1/4, and E) 2/18.
- In a group of multilingual men and women, what's the probability that a randomly chosen French speaker will be a man? The options given were A) 47/99, B) 35/68, C) 92/193, and D) 52/99.
- If a dice is thrown twice, what's the probability that the first throw will result in a 3, given that the sum of the two throws equals 7? The options given were A) 2/6, B) 1/6, C) 1/2, and D) 3/7.
- **Claude v2 Feedback**: @f3l1p3_lv commended the Claude v2 model, expressing that it was "the best".
### Nous Research AI Channel: [general](https://discord.com/channels/1053877538025386074/1149866623109439599) (7 messages🔥):
**AI Music Generation Discussion**:
- @yorth_night shared Suno.ai's music generation capabilities and how even a seasoned musician friend couldn't tell the music was AI-generated. He also praised Suno's instrumentals, considering them perfect with occasional minor issues in the vocals.
- @wooser asked how Suno.ai creates its music, with @yorth_night suggesting **Bark** could be the base model for voice generation.
**Fine-tuning and Training Models**:
- @cue asked for help to fully finetune **LLama 2 70B** on 8x A100 GPUs. He mentioned trying with axolotl, Huggingface scripts, accelerate & deepspeed but experienced **Out of Memory (OOM)** errors. He also mentioned a blog post on **Huggingface** about this ([https://huggingface.co/blog/ram-efficient-pytorch-fsdp](https://huggingface.co/blog/ram-efficient-pytorch-fsdp)) but was unsuccessful in replicating the methods.
- He got responses from @wooser suggesting to check with the **Axolotl** discord and mention any specific OOM errors, and discussing possible issues due to Docker setups.
- @teknium suggested that 16x 80gb gpus on multiple nodes are required for a full fine-tune of a 70b model, while @tokenbender joked about the possibility of finetuning with **deepspeed** taking forever.
- @euclaise mentioned that **LoRA** is equivalent to full finetuning if a sufficiently large rank was selected and the embedding layers were also finetuned.
- Despite this, @teknium mentioned observations by another user that a high rank was doing worse than both full finetuning and a low rank, pointing out that full rank **LoRA** might not be equivalent to full finetuning after all.
**Nous Capybara 34B API and Playground**:
- @alexatallah introduced a new **Capybara 34B** API and playground ([https://openrouter.ai/models/nousresearch/nous-capybara-34b](https://openrouter.ai/models/nousresearch/nous-capybara-34b)).
- @gabriel_syme asked about the number of models expected to be found in the new API and playground.
**Organization of Nous Research**:
- @f3l1p3_lv asked about the leadership structure of Nous Research, which @teknium clarified by stating that there are four founders of the organization, including himself.
**AI Models Training Resources**:
- @00brad requested guidance on training models to optimize for structured output. @teknium advised him to look into **Axolotl**.
**GPT Slops**:
- @alpindale urged others to contribute to "gptslop" on his GitHub ([https://github.com/AlpinDale/gptslop](https://github.com/AlpinDale/gptslop)).
- @giftedgummybee suggested to @alpindale to refer to Eric's list for GPT4 filtering dataset for finding the slops, and if necessary, he could request more dataset from @teknium. However, @teknium clarified that his dataset was mostly about alignment slops, not prose slops.
### Nous Research AI Channel: [ask-about-llms](https://discord.com/channels/1053877538025386074/1154120232051408927) (7 messages🔥):
**AI Development and Use-Cases on LLMs and Non-Roman Language Training**:
- **LLMs and Rust Code Analysis**: @ac1dbyte shared their experience using Hermes AI to conduct meticulous, vulnerability-focused reviews on Rust programming language codes and further discussed their efforts in semi-normalizing the generated reports to JSON for better integrations, such as with MongoDB. They also shared their base prompt for 'Bytecode AI'.
- **Continued Pre-training vs. Full Finetuned**: @.wooser and @teknium discussed the differences between continued pre-training and a full finetune on LLMs, essentially concluding that the difference lies in the size of the dataset. @.wooser humorously described the process as "changing the numbers by showing it new stuff".
- **Training LLMs on Non-Roman Languages**: The users also discussed the training of LLMs on non-Roman languages, with emphasis on Japanese. @.wooser mentioned that minimum of 40b tokens and a full finetune are generally required for such training, while a new tokenizer might also be needed. They also shared a link from Hugging Face's NLP course that provides a guide on [how to train a tokenizer](https://huggingface.co/learn/nlp-course/chapter6/2).
- **Application of LLMs in Fiction Text Data**: @.wooser discussed their aim to use a large Japanese fiction text data to train LLM into resembling NovelAI or the older versions of AI Dungeon. They presented their current approach (with the instruction 'Finish the following section of text' repeatedly in the prompt) and sought advice on whether this could potentially affect the AI's function. They also asked if there were better approaches to achieve their aim.
- **Perceptions and Fear of AI**: The discussions touched on the widespread fear and misunderstandings related to AI, as mentioned by @teknium and @.wooser. They both mentioned experiences of people being scared due to risk to jobs and issues like deepfakes.
- **Other LLM models**: @f3l1p3_lv asked whether LLMs could use models other than the Transformer neural entanglement model.
Relevant Links:
- Hugging Face's NLP course on [how to train a tokenizer](https://huggingface.co/learn/nlp-course/chapter6/2).
### Nous Research AI Channel: [memes](https://discord.com/channels/1053877538025386074/1166105758635655270) (7 messages🔥):
**AI and Memes Discussion**:
- @.wooser reacted to a link shared by @_automagic from Y Combinator News titled "[Artificial Intelligence can't deal with Chaos](https://news.ycombinator.com/item?id=36225723)", expressing that chaos is a perfect field for them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment