Skip to content

Instantly share code, notes, and snippets.

View mlaprise's full-sized avatar

Martin Laprise mlaprise

View GitHub Profile

Training open-source LLMs on ChatGPT output is a really bad idea.

Everyone is now racing to create open-source alternatives to compete with GPT3.5/GPT4. A common shortcut used by some teams to bootstrap their effort is to fine-tune their model on ChatGPT output. I used to think it was a good idea and totally fair play to do this. Actually, I still think it’s fair play. OpenAI effectively distilled the entire web into its models. They are saying themself that they are using publicly accessible information (mostly). So distilling their model is, in effect, distilling the public open web, so small Term of Service details aside, I don’t see major ethical problems with that. Right? Well, it’s not entirely true and I realized now that, even when ignoring the ethical considerations, using their output is a really bad idea.

First of all, from a purely technical point of view, as @yoavgo is explaining it beautifully in his recent post, there is no way to align LLMs correctly without the RLHF component. I encourag

@mlaprise
mlaprise / stablediffusionwalk.py
Created August 16, 2022 15:09 — forked from karpathy/stablediffusionwalk.py
hacky stablediffusion code for generating videos
"""
draws many samples from a diffusion model by slerp'ing around
the noise space, and dumps frames to a directory. You can then
stitch up the frames with e.g.:
$ ffmpeg -r 10 -f image2 -s 512x512 -i out/frame%04d.jpg -vcodec libx264 -crf 10 -pix_fmt yuv420p test.mp4
THIS FILE IS HACKY AND NOT CONFIGURABLE READ THE CODE, MAKE EDITS TO PATHS AND SETTINGS YOU LIKE
THIS FILE IS HACKY AND NOT CONFIGURABLE READ THE CODE, MAKE EDITS TO PATHS AND SETTINGS YOU LIKE
THIS FILE IS HACKY AND NOT CONFIGURABLE READ THE CODE, MAKE EDITS TO PATHS AND SETTINGS YOU LIKE
@mlaprise
mlaprise / TDA_resources.md
Created November 5, 2016 01:30 — forked from calstad/TDA_resources.md
List of resources for TDA

Quick List of Resources for Topological Data Analysis with Emphasis on Machine Learning

This is just a quick list of resourses on TDA that I put together for @rickasaurus after he was asking for links to papers, books, etc on Twitter and is by no means an exhaustive list.

Survey Papers

Both Carlsson's and Ghrist's survey papers offer a very good introduction to the subject

Other Papers and Web Resources

{u'anonymousId': u'9c382115-1b4e-4d54-9b91-515eb627b284',
u'category': None,
u'channel': u'client',
u'context.ip': u'82.11.127.141',
u'context.library.name': u'analytics.js',
u'context.library.version': u'2.11.1',
u'context.page.path': u'/inbox',
u'context.page.referrer': u'',
u'context.page.search': u'',
u'context.page.title': u'Klood',
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA8vqEl2gcm8gG8o7Jopv8l2MSpKiqp/QZJRIqeVVGzo8fsQoP+aYxtfLXAuuAfwgIwbRBD7h5Fb4AI7GUsIFSGSP/n5sMW01hhU2mEun3e+EO4qW1O95U95tVLE/fZu4N6if4Ep/sY9ilBIjkoWDxdZNajwT6BTGdvqLHRTx41KzvE08J7xNqC/B27GECPnUvuRhQD/CofiRwun1sJKXQ2TtZ937HoF5TyP7NbpsVLVZYSFI7HXK4ij6bbJgxlVK1EWyxh5Dcjnh7CRv/0yD1ldJsWHnapSeDDM0mHFBboHkSwdqlQNUcPRUE76ou3sVih8DD+RRsh6z13CaE2CQgOQ== mlaprise@mlaprise-laptop
@mlaprise
mlaprise / sorting by relevance
Created April 14, 2016 15:17
Sorting mentions by relevance
### Sort news by authority
GET documents-2016-04-12/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "viagra",
"fields": ["content.title", "content.text"]
@mlaprise
mlaprise / gist:54feda33e5f58e79188f
Created December 3, 2015 16:59
Logsplitter error
Backend error message
---------------------
java.lang.IllegalArgumentException: Invalid format: "31bf8798-5495-42f4-8eab-76c906ef5470" is malformed at "bf8798-5495-42f4-8eab-76c906ef5470"
at org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:873)
at org.apache.pig.piggybank.evaluation.datetime.truncate.ISOHelper.parseDateTime(ISOHelper.java:65)
at org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToHour.exec(ISOToHour.java:91)
at org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToHour.exec(ISOToHour.java:83)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
Traceback (most recent call last):
File "/home/cogtree/extract_post_cycle.py", line 207, in <module>
get_post_cycles(apikey)
File "/home/cogtree/extract_post_cycle.py", line 183, in get_post_cycles
.saveAsTextFile("s3n://parsely-pyspark/output/post_cycle_logs/{}.gz".format(apikey)))
File "/opt/spark/python/pyspark/rdd.py", line 1288, in saveAsTextFile
keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9885.saveAsTextFile.
@mlaprise
mlaprise / lorenz3d.py
Created March 3, 2010 13:38
Generate the Lorenz Attractor and it representation in a 3D phase space
#!/usr/bin/env python
import vtk
from scipy import *
from scipy import integrate
from pylab import *
from vtk.util.colors import tomato, banana
nbrPoints = 5000
@mlaprise
mlaprise / mandelbrot.py
Created March 3, 2010 02:37
Generate a representation of the Mandelbrot Set
from numpy import *
import matplotlib.pylab as pl
maxIteration = 128
z_min = -2-1j
z_max = 1+1j
# Set the image size here
imageSize = [512,512]
image = zeros(imageSize,int)