Skip to content

Instantly share code, notes, and snippets.

View Bergvca's full-sized avatar

Chris van den Berg Bergvca

View GitHub Profile
@Bergvca
Bergvca / stratified_sampling.R
Created April 23, 2017 15:25
Example on how to do stratified sampling in Caret. This is useful for imbalanced datasets, and can be used to give more weight to a minority class
len_pos <- nrow(example_dataset[example_dataset$target==1,])
len_neg <- nrow(example_dataset[example_dataset$target==0,])
train_model <- function(training_data, labels, model_type, ...) {
experiment_control <- trainControl(method="repeatedcv",
number = 10,
repeats = 2,
classProbs = T,
summaryFunction = custom_summary_function)
train(x = training_data,
@Bergvca
Bergvca / unnittestexample.py
Last active December 22, 2022 20:54
Some unnittest + Mock examples in Python. Includes examples on how to mock a entire class (ZipFile), mock an Iterator object and how to use side_effect properly
import unittest
import os
from zipfile import ZipFile
from mock import MagicMock, patch, Mock, mock_open
# The functions that are tested:
def function_to_test_zipfile(example_arg):
with ZipFile(example_arg, 'r') as zip_in:
for input_file in zip_in.infolist():
@Bergvca
Bergvca / Name matching in SQL Server.sql
Last active April 4, 2018 11:51
Name matching in SQL Server example
-- First create matches using a UDF, here I am using a combination of Jaro Winkler and (a normalized version of) Levensthein
--
-- Input: cleaned_table: a table with "cleaned" names
-- Output: tmp_groups: a table with uid - group_id tuples. Each group_id contains all uid's that belong to names that match.
DROP TABLE #matches
SELECT a.clean_Name,
a.uid,
b.clean_Name clean_name_2,
@Bergvca
Bergvca / Pyspark_LDA_Example.py
Created February 3, 2016 13:59
Example on how to do LDA in Spark ML and MLLib with python
import findspark
findspark.init("[spark install location]")
import pyspark
import string
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.mllib.util import MLUtils
from pyspark.sql.types import *
from pyspark.ml.feature import CountVectorizer, CountVectorizerModel, Tokenizer, RegexTokenizer, StopWordsRemover