You could do this through positive lookahead,
>>> import re
>>> s = "My name is really nice. This is so awesome."
>>> m = re.findall(r'(?=(\b\w+\b \S+))', s)
>>> m
#-*- coding:utf-8 -*-
from itertools import izip
from math import sqrt
from itertools import islice
from HTMLParser import HTMLParser
import MongoDBConn
from bson import ObjectId
# Custom plugins may be added to ~/.oh-my-zsh/custom/plugins/
# Example format: plugins=(rails git textmate ruby lighthouse)
# Add wisely, as too many plugins slow down shell startup.
plugins=(git autojump)
# User configuration
export PATH="/Users/dongjian/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"
# export MANPATH="/usr/local/man:$MANPATH"
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
spark rdd 转为dataframe 转为 temptable
git ignore file above 1m
find . -size +500k| grep -v ipynb >> .gitignore
This script provides reusable code for generating lead/lag time
delta features (using epoch time) for an arbitrary choice of lead/lag orders.
You can use this to generate useful visit time delta features for
this competition,and it should be fairly straightforward to
apply the functions to other datasets as well. Feel free to just
take the output from this kernel as features, they'll match the original
order of train and test. I hope it's helpful!
This is what you could do, split the string with pipe and explode the data using spark function
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(("a1", "b1", "c1|c2|c3|c4")).toDF("A", "B", "C")