Skip to content

Instantly share code, notes, and snippets.

@hivefans
hivefans / datetime_timestamp.py
Last active March 17, 2020 02:03
|-|{"files":{"datetime_timestamp.py":{"env":"plain"}},"tag":"bigdata"}
#coding:UTF-8
import time
dt = "2016-05-05 20:28:54"
#转换成时间数组
timeArray = time.strptime(dt, "%Y-%m-%d %H:%M:%S")
#转换成时间戳
timestamp = time.mktime(timeArray)
#转换成新的时间格式(20160505-20:28:54)
@hivefans
hivefans / demo.py
Last active March 17, 2020 02:03 — forked from martinburch/demo.py
Python MySQL upsert|-|{"files":{"requirements.txt":{"env":"plain"},"upsert.py":{"env":"plain"},"demo.py":{"env":"plain"}},"tag":"Uncategorized"}
#!/usr/bin/env python
# encoding: utf-8
import MySQLdb
from upsert import upsert
db = MySQLdb.connect(host="localhost", user="root", passwd="", db="demo", charset="utf8")
c = db.cursor()
import warnings
warnings.filterwarnings("ignore", "Unknown table.*")
@hivefans
hivefans / Spark Dataframe Cheat Sheet.py
Last active October 22, 2020 10:27 — forked from crawles/Spark Dataframe Cheat Sheet.py
Cheat sheet for Spark Dataframes (using Python)|-|{"files":{"Spark Dataframe Cheat Sheet.py":{"env":"plain"}},"tag":"bigdata"}
# A simple cheat sheet of Spark Dataframe syntax
# Current for Spark 1.6.1
# import statements
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
#creating dataframes
df = sqlContext.createDataFrame([(1, 4), (2, 5), (3, 6)], ["A", "B"]) # from manual data
@hivefans
hivefans / upsert_table.sql
Last active March 17, 2020 02:03 — forked from bembengarifin/upsert_table.sql
mysql bulk insert, with duplicate key update (upsert), and with conditional data update|-|{"files":{"upsert_table.sql":{"env":"plain"}},"tag":"bigdata"}
/*
references:
- https://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.html
- https://stackoverflow.com/questions/32777081/bulk-insert-and-update-in-mysql
- https://thewebfellas.com/blog/conditional-duplicate-key-updates-with-mysql
*/
/* create a new database and use it */
drop database if exists test_upsert;
create database test_upsert;
@hivefans
hivefans / hbase.rest.scanner.filters.md
Last active March 17, 2020 02:03 — forked from stelcheck/hbase.rest.scanner.filters.md
HBase Stargate REST API Scanner Filter Examples|-|{"files":{"hbase.rest.scanner.filters.md":{"env":"plain"}},"tag":"bigdata"}

Stargate Scanner Filter Examples

Introduction

So yeah... no documentation for the HBase REST API in regards to what should a filter look like...

So I installed Eclipse, got the library, and took some time to find some of the (seemingly) most useful filters you could use. I'm very green at anything regarding HBase, and I hope this will help anyone trying to get started with it.

What I discovered is that basically, attributes of the filter object follow the same naming than in the documentation. For this reason, I have made the link clickable and direct them to the HBase Class documentation attached to it; check for the instantiation argument names, and you will have your attribute list (more or less).

@hivefans
hivefans / hbase-rest-examples.sh
Last active March 17, 2020 02:03 — forked from karmi/hbase-rest-examples.sh
Experiments with the HBase REST API|-|{"files":{"hbase-rest-examples.sh":{"env":"plain"}},"tag":"bigdata"}
#!/usr/bin/env bash
#
# ===================================
# Experiments with the HBase REST API
# ===================================
#
# <http://hbase.apache.org/docs/r0.20.4/api/org/apache/hadoop/hbase/rest/package-summary.html>
#
# Usage:
#
@hivefans
hivefans / spark_gpkey_comkey
Last active March 17, 2020 02:03
spark.groupByKey,combineByKey|-|{"files":{"spark_gpkey_comkey":{"env":"plain"}},"tag":"bigdata"}
pairRdd中最好不要用groupByKey,因为groupBy类函数会使用shuffl带来性能问题,所以pairRdd一般使用combineByKey:
示例:
使用前rdd格式: JavaPairRDD<String, HotsCompare>
pairRdd2 = pairRdd.combineByKey(e -> {
ArrayList<HotsCompare> list = new ArrayList<HotsCompare>();
list.add(e);
return list;
}, (list, e) -> {
list.add(e);
return list;
@hivefans
hivefans / pyrdd_access_javardd.md
Last active March 17, 2020 02:03 — forked from yu-iskw/testing.md
PySpark serializer and deserializer testing with a nested and complicated value|-|{"files":{"pyrdd_access_javardd.md":{"env":"plain"}},"tag":"bigdata"}

Python =(parallelize)=> RDD =(collect)=> Python

It works well.

>>> sc = SparkContext('local', 'test', batchSize=2)
>>> data = [([1, 0], [0.5, 0.499]), ([0, 1], [0.5, 0.499])]
>>> rdd = sc.parallelize(data)
>>> rdd.collect()
[([1, 0], [0.5, 0.499]), ([0, 1], [0.5, 0.499])]
@hivefans
hivefans / watch_log.py
Last active March 17, 2020 02:03 — forked from albsen/watch_log.py
python log file watcher|-|{"files":{"watch_log.py":{"env":"plain"}},"tag":"bigdata"}
#!/usr/bin/env python
"""
Real time log files watcher supporting log rotation.
Author: Giampaolo Rodola' <g.rodola [AT] gmail [DOT] com>
License: MIT
"""
import os
@hivefans
hivefans / NginxLineParser.scala
Last active March 17, 2020 02:03
|-|{"files":{"NginxLineParser.scala":{"env":"plain"},"build.sbt":{"env":"plain"},"NginxLogRecord.scala":{"env":"plain"},"nginx.log":{"env":"plain"},"WordCount.scala":{"env":"plain"}},"tag":"Uncategorized"}
package spark.example
/**
* Created by shidongjie on 2016/12/4.
*/
class NginxLineParser extends Serializable {
private val regex = "([^-]*)\\s+-\\s+(\\S+)\\s+\\[(\\d{2}\\/[a-zA-Z]{3}\\/\\d{4}:\\d{2}:\\d{2}:\\d{2}\\s+-\\d{4})\\]\\s+\"(.+)\"\\s+(\\d{1,}\\.\\d{3})\\s+(\\d+)\\s+\"([^\"]+)\"\\s+Agent\\[\"([^\"]+)\"\\]\\s+(-|\\d.\\d{3,})\\s+(\\S+)\\s+(\\d{1,}).*".r
/**
* @param record Assumed to be an Nginx access log.