Skip to content

Instantly share code, notes, and snippets.

Useful Cheat Sheets

# Drop columns
df = df.drop(['col1', 'col2', 'col3'], axis=1)

# Count NaN values for each column
df.isnull().sum()

install docker :

wget -qO- https://get.docker.io/ | sed -e "s/docker.com/docker.io/g" | sh
```#### install make : 

```sh
apt-get install make
class ForwardingRequestHandler (tornado.web.RequestHandler):
def handle_response(self, response):
if response.error and not isinstance(response.error,
tornado.httpclient.HTTPError):
note("response has error %s", response.error)
self.set_status(500)
self.write("Internal server error:\n" +
str(response.error))
@mr1azl
mr1azl / faster_toPandas.py
Created April 13, 2016 22:13 — forked from joshlk/faster_toPandas.py
PySpark faster toPandas using mapPartitions
import pandas as pd
def _map_to_pandas(rdds):
""" Needs to be here due to pickling issues """
return [pd.DataFrame(list(rdds))]
def toPandas(df, n_partitions=None):
"""
Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is
repartitioned if `n_partitions` is passed.

An introduction to distributed systems

Copyright 2014, 2016 Kyle Kingsbury

This outline accompanies a 12-16 hour overview class on distributed systems fundamentals. The course aims to introduce software engineers to the practical basics of distributed systems, through lecture and discussion. Participants will gain an intuitive understanding of key distributed systems terms, an overview of the algorithmic landscape, and explore production concerns.

Pour la config réseau il faudra juste déclarer les fqdn des noeuds (par exemple nodeX.filrouge.com) dans les fichiers /etc/hosts et /etc/sysconfig/network ( pour centos ou redhat).
Après le problème avec cet offre est qu'il ne donne qu'une seule adresse public, du coup les autres noeuds n'ont pas accès à internet; donc il faudra installer un proxy sur la machine connectée à internet et configurer le reste des noeuds pour passer par celui ci.
voila un tuto pr installer le proxy
installer le proxy ;
http://www.krizna.com/centos/how-to-install-squid-proxy-on-centos-6/
http://www.cyberciti.biz/tips/linux-unix-squid-proxy-server-authentication.html
@mr1azl
mr1azl / tweet_dumper.py
Created January 18, 2016 14:01 — forked from yanofsky/LICENSE
A script to download all of a user's tweets into a csv
#!/usr/bin/env python
# encoding: utf-8
import tweepy #https://github.com/tweepy/tweepy
import csv
#Twitter API credentials
consumer_key = ""
consumer_secret = ""
access_key = ""
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@mr1azl
mr1azl / comparison.py
Created December 29, 2015 09:59 — forked from patrickfuller/comparison.py
Compares tornado.auth.GoogleMixin with tornado.auth.GoogleOAuth2Mixin. The latter is required after google's OAuth updates.
"""
A webserver to test Google OAuth in a couple of scenarios.
"""
import argparse
import time
import tornado.ioloop
import tornado.web
import tornado.auth
import tornado.gen
@mr1azl
mr1azl / mysql_to_big_query.sh
Created December 24, 2015 15:19 — forked from shantanuo/mysql_to_big_query.sh
Copy MySQL table to big query. If you need to copy all tables, use the loop given at the end. Exit with error code 3 if blob or text columns are found. The csv files are first copied to google cloud before being imported to big query.
#!/bin/sh
TABLE_SCHEMA=$1
TABLE_NAME=$2
mytime=`date '+%y%m%d%H%M'`
hostname=`hostname | tr 'A-Z' 'a-z'`
file_prefix="trimax$TABLE_NAME$mytime$TABLE_SCHEMA"
bucket_name=$file_prefix
splitat="4000000000"
bulkfiles=200