BigAN

## return bigram python

0
down vote
accepted
You could do this through positive lookahead,

>>> import re
>>> s = "My name is really nice. This is so awesome."
>>> m = re.findall(r'(?=(\b\w+\b \S+))', s)
>>> m

## save
#-*- coding:utf-8 -*-
from itertools import izip
from math import sqrt
from itertools import islice
from HTMLParser import HTMLParser
import MongoDBConn

from bson import ObjectId
dbconn=MongoDBConn.DBConn()
dbconn.connect()

## zshrc
# Custom plugins may be added to ~/.oh-my-zsh/custom/plugins/
# Example format: plugins=(rails git textmate ruby lighthouse)
# Add wisely, as too many plugins slow down shell startup.
plugins=(git autojump)

# User configuration

export PATH="/Users/dongjian/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"
# export MANPATH="/usr/local/man:$MANPATH"

## eliminate dedup and keep one in list
def f7(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]


## spark rdd 转为dataframe 转为 temptable Raw
sqlContext.createDataFrame(rs,["white_value"]).registerTempTable("stray_user_white_list_af_expand")

## 1
find . -size +500k| grep -v ipynb >> .gitignore

## time.py
"""
This script provides reusable code for generating lead/lag time
delta features (using epoch time) for an arbitrary choice of lead/lag orders.

You can use this to generate useful visit time delta features for
this competition,and it should be fairly straightforward to
apply the functions to other datasets as well. Feel free to just
take the output from this kernel as features, they'll match the original
order of train and test. I hope it's helpful!

## Spark split a column value into multiple rows

3
down vote
accepted
This is what you could do, split the string with pipe and explode the data using spark function

import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(("a1", "b1", "c1|c2|c3|c4")).toDF("A", "B", "C")

	0
	down vote
	accepted
	You could do this through positive lookahead,

	>>> import re
	>>> s = "My name is really nice. This is so awesome."
	>>> m = re.findall(r'(?=(\b\w+\b \S+))', s)
	>>> m
	#-- coding:utf-8 --
	from itertools import izip
	from math import sqrt
	from itertools import islice
	from HTMLParser import HTMLParser
	import MongoDBConn

	from bson import ObjectId
	dbconn=MongoDBConn.DBConn()
	dbconn.connect()
	# Custom plugins may be added to ~/.oh-my-zsh/custom/plugins/
	# Example format: plugins=(rails git textmate ruby lighthouse)
	# Add wisely, as too many plugins slow down shell startup.
	plugins=(git autojump)

	# User configuration

	export PATH="/Users/dongjian/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"
	# export MANPATH="/usr/local/man:$MANPATH"
	def f7(seq):
	seen = set()
	seen_add = seen.add
	return [x for x in seq if not (x in seen or seen_add(x))]
	"""
	This script provides reusable code for generating lead/lag time
	delta features (using epoch time) for an arbitrary choice of lead/lag orders.

	You can use this to generate useful visit time delta features for
	this competition,and it should be fairly straightforward to
	apply the functions to other datasets as well. Feel free to just
	take the output from this kernel as features, they'll match the original
	order of train and test. I hope it's helpful!

	3
	down vote
	accepted
	This is what you could do, split the string with pipe and explode the data using spark function

	import org.apache.spark.sql.functions._
	import spark.implicits._

	val df = Seq(("a1", "b1", "c1\|c2\|c3\|c4")).toDF("A", "B", "C")