Skip to content

Instantly share code, notes, and snippets.

@BigAN
BigAN / return bigram python
Created December 19, 2014 10:19
return bigram python
0
down vote
accepted
You could do this through positive lookahead,
>>> import re
>>> s = "My name is really nice. This is so awesome."
>>> m = re.findall(r'(?=(\b\w+\b \S+))', s)
>>> m
@BigAN
BigAN / save
Created December 26, 2014 02:37
#-*- coding:utf-8 -*-
from itertools import izip
from math import sqrt
from itertools import islice
from HTMLParser import HTMLParser
import MongoDBConn
from bson import ObjectId
dbconn=MongoDBConn.DBConn()
dbconn.connect()
@BigAN
BigAN / zshrc
Created May 23, 2016 03:44
zshrc配置
# Custom plugins may be added to ~/.oh-my-zsh/custom/plugins/
# Example format: plugins=(rails git textmate ruby lighthouse)
# Add wisely, as too many plugins slow down shell startup.
plugins=(git autojump)
# User configuration
export PATH="/Users/dongjian/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"
# export MANPATH="/usr/local/man:$MANPATH"
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
@BigAN
BigAN / spark rdd 转为dataframe 转为 temptable Raw
Last active May 31, 2016 11:50
spark rdd 转为dataframe 转为 temptable
sqlContext.createDataFrame(rs,["white_value"]).registerTempTable("stray_user_white_list_af_expand")
@BigAN
BigAN / 1
Created June 1, 2018 02:11
git ignore file above 1m
find . -size +500k| grep -v ipynb >> .gitignore
@BigAN
BigAN / time.py
Last active October 10, 2018 01:19
EddyFlexible Lead/Lag Feature Generation
"""
This script provides reusable code for generating lead/lag time
delta features (using epoch time) for an arbitrary choice of lead/lag orders.
You can use this to generate useful visit time delta features for
this competition,and it should be fairly straightforward to
apply the functions to other datasets as well. Feel free to just
take the output from this kernel as features, they'll match the original
order of train and test. I hope it's helpful!
3
down vote
accepted
This is what you could do, split the string with pipe and explode the data using spark function
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(("a1", "b1", "c1|c2|c3|c4")).toDF("A", "B", "C")