Skip to content

Instantly share code, notes, and snippets.

View eliasah's full-sized avatar

Elie A. eliasah

View GitHub Profile
@eliasah
eliasah / SQLTransformerWithJoin.scala
Created June 17, 2020 12:37 — forked from MLnick/SQLTransformerWithJoin.scala
Using SQLTransformer to join DataFrames in ML Pipeline
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)
Type in expressions to have them evaluated.
Type :help for more information.
@eliasah
eliasah / multiple-files-remove-prefix.md
Created September 27, 2019 14:35
Remove prefix from multiple files in Linux console

Bash

for file in prefix*; do mv "$file" "${file#prefix}"; done;

The for loop iterates over all files with the prefix. The do removes from all those files iterated over the prefix.

Here is an example to remove "bla_" form the following files:

bla_1.txt
bla_2.txt
@eliasah
eliasah / custom_s3_endpoint_in_spark.md
Created August 30, 2019 08:54 — forked from tobilg/custom_s3_endpoint_in_spark.md
Description on how to use a custom S3 endpoint (like Rados Gateway for Ceph)

Custom S3 endpoints with Spark

To be able to use custom endpoints with the latest Spark distribution, one needs to add an external package (hadoop-aws). Then, custum endpoints can be configured according to docs.

Use the hadoop-aws package

bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.2

SparkContext configuration

strip_glm <- function(cm) {
cm$y = c()
cm$model = c()
cm$residuals = c()
cm$fitted.values = c()
cm$effects = c()
cm$qr$qr = c()
cm$linear.predictors = c()
cm$weights = c()
@eliasah
eliasah / bootstrap-install-zeppelin-0.8-aws-linux.sh
Created November 21, 2018 14:16 — forked from vak/bootstrap-install-zeppelin-0.8-aws-linux.sh
Custom bootstrap script to install Zeppelin 0.8 on AWS EMR (tested on EMR 5.16.0)
#!/bin/bash -ex
# ATTENTION:
#
# 1. ensure you have about 1Gb on the storage of /usr/lib/ for the Zeppelin huge bundle chosen by default below,
# or choose a smaller bundle from Zeppelin web-site
#
# 2. adjust values of ZEPPELIN_NOTEBOOK_S3_BUCKET
# and ZEPPELIN_NOTEBOOK_S3_USER if you need S3-persistance of your Zeppelin Notebooks to your S3 bucket
# otherwize just remove all three last exports lines starting from 'export ZEPPELIN_NOTEBOOK_S'
@eliasah
eliasah / terminal-git-branch-name.md
Created October 2, 2018 08:28 — forked from joseluisq/terminal-git-branch-name.md
Add Git Branch Name to Terminal Prompt (Mac)

Add Git Branch Name to Terminal Prompt (Mac)

image

Open ~/.bash_profile in your favorite editor and add the following content to the bottom.

# Git branch in prompt.

parse_git_branch() {
@eliasah
eliasah / minikube.md
Last active May 23, 2018 08:54 — forked from codesword/minikube.md
Installing minikube using xhyve driver

Install docker-machine-driver-xhyve

docker-machine-driver-xhyve is a docker machine driver plugin for xhyve native OS X Hypervisor. xhyve is a lightweight OS X virtualization solution. In my opinion, it's a far better option than virtualbox for running minikube.

Brew

On MacOS sierra, download latest using

brew install docker-machine-driver-xhyve --HEAD
import pandas as pd
import numpy as np
import scipy
import scipy.stats as sts
import random
import pyspark
import pyspark.sql.types as stypes
import pyspark.sql.functions as sfunctions
@eliasah
eliasah / IntelliJ_IDEA__Perf_Tuning.txt
Created January 9, 2017 15:02 — forked from P7h/IntelliJ_IDEA__Perf_Tuning.txt
Performance tuning parameters for IntelliJ IDEA. Add these params in idea64.exe.vmoptions or idea.exe.vmoptions file in IntelliJ IDEA. If you are using JDK 8.x, please knock off PermSize and MaxPermSize parameters from the tuning configuration.
-server
-Xms2048m
-Xmx2048m
-XX:NewSize=512m
-XX:MaxNewSize=512m
-XX:PermSize=512m
-XX:MaxPermSize=512m
-XX:+UseParNewGC
-XX:ParallelGCThreads=4
-XX:MaxTenuringThreshold=1

Hello, I am using linear SVM to train my model and generate a line through my data. However my model always predicts 1 for all the feature examples. Here is my code:

print data_rdd.take(5) [LabeledPoint(1.0, [1.9643,4.5957]), LabeledPoint(1.0, [2.2753,3.8589]), LabeledPoint(1.0, [2.9781,4.5651]), LabeledPoint(1.0, [2.932,3.5519]), LabeledPoint(1.0, [3.5772,2.856])]


from pyspark.mllib.classification import SVMWithSGD from pyspark.mllib.linalg import Vectors from sklearn.svm import SVC