wut0n9

## pd_apply_example.py
# apply方法同时为多列赋值操作

import pandas as pd

df = pd.read_excel('test.xlsx',encoding='utf8')
df[['col2', 'col3']] = df.apply(lambda row: ps.Series(func(row[0])), axis=1)

def func(col1):
  # 为col2和col3赋同一值
  return col1, col1

## jupyter_start.sh

LANG=zn
jupyter notebook

## ipython_matplotlib_img.py
# ipython 生成矢量图
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

## grad_norm_exmple.py
# 梯度裁剪
# https://github.com/tensorflow/models/blob/56cbd1f2770f1e7386db43af37f6f11b4d85e3da/tutorials/rnn/ptb/ptb_word_lm.py#L159-L165
tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(self._cost, tvars),
                                  config.max_grad_norm)
optimizer = tf.train.GradientDescentOptimizer(self._lr)
self._train_op = optimizer.apply_gradients(
    zip(grads, tvars),
    global_step=tf.train.get_or_create_global_step())

## compute_idf_df_example.py
"""Calculates frequencies of terms in documents and in corpus. Also computes inverse document frequencies."""
for document in self.corpus:
    frequencies = {}
    self.doc_len.append(len(document))
    for word in document:
        if word not in frequencies:
            frequencies[word] = 0
        frequencies[word] += 1
    self.f.append(frequencies)

## word2vec_comparision.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                wut0n9
                / word2vec_comparision.md
            
            
              Last active
              January 10, 2019 01:26
            
              
                各种word2vec算法优缺点比较
              
          
    Word2vec 和 GloVe 都无法处理好词的变形问题，例如英文中的 study、studies 以及 studied 都表示一个意思，但是这两种方法都将这些相同意思不同形态的词当成不同的词，这就会带来信息的冗余或者丢失。
对此，Facebook 研究院提出了 FastText 的模型，目的就是对词的变形进行建模：

下面我们对 Word2vec、GloVe 以及 FastText 进行一个对比：

在 NLP 具体任务中应用的效果来看：效果最差的是 CBOW；其次是 Skip-gram（GloVe 跟 Skip-gram 差不多）；效果最好的是 FastText。
FastText 能很好地处理 OOV 问题，并能很好地对词的变形进行建模，对词变形非常丰富的德语、西班牙语等语言都非常有效；
从效率（训练时间）上来看：CBOW 的效率是最高的，其次是 GloVe；之后是 Skip-gram；而最低的是 FastText。


## IR_based_QA.md

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                wut0n9
                / IR_based_QA.md
            
            
              Last active
              January 27, 2019 06:10
            
              
                基于检索式聊天机器人设计&&社区问答设计
              
          
    基于检索的聊天机器人的架构，该架构分为离线和在线两部分：


离线部分：首先需要准备存储了大量人机交互数据的 Index；之后 Matching Model 能够评估输入与输出的相似度，并给出一个相似度的评分——Matching Model 越多，相似度得分个数也越多，其可以评估输出对于输入来说是否是一个比较合适的输出；然后这些得分通过分类器（设置一个预值，大于预值的都认为是可作为选择的输出），合成一个最终得分列表，并由排序模型对得分进行排序，选择排在最前面的输出作为对当前输入的回复。


在线部分：有了这些离线准备，我们可以将当前的上下文输入到存储了大量人机交互数据的 Index 中进行检索，并检索出一些候选回复；接着，回复候选和当前上下文一起通过 Matching Model 来进行打分，每一个得分都作为一个 Feature，每一个候选回复和上下文因而就产生了 Feature Vector；之后，这些 Feature Vector 通过排序模型或者分类器变成最后的得分；最终，排序模型对得分进行排序，并从排序列表中选出合适的回复。


其中，基于检索的聊天机器人很大程度上借鉴了搜索引擎的成果，比如 learning to rank，其新的地方主要在于——当给定上下文和候选回复时，通过建立一个 Matching Model 来度量候选回复是否能够作为上下文的回复。基于检索的聊天机器人本质上是重用已有的人的回复，来对人的新的输入进行回复。

  
## text_correct.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                wut0n9
                / text_correct.md
            
            
              Last active
              March 11, 2019 10:51
            
              
                文本纠错 #query_rewrite
              
          
    《Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape》[1, Yu, 2013]
论文提供了一种准确较高、召回较低的纠错方法。

Character级别 n-gram language model。
拼音和字形召回候选
词典过滤掉部分无效候选
取最高语言模型打分
高于既定阈值则认为是替换候选


## rank_strength_weekness.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              1 star
            
          
                wut0n9
                / rank_strength_weekness.md
            
            
              Created
              January 18, 2019 05:37
            
              
                各排序模型优缺点比较及DNN在特征表达方面的优势
              
          
LR可以视作单层单节点的线性网络结构。模型优点是可解释性强。通常而言，良好的解释性是工业界应用实践比较注重的一个指标，它意味着更好的可控性，同时也能指导工程师去分析问题优化模型。但是LR需要依赖大量的人工特征挖掘投入，有限的特征组合自然无法提供较强的表达能力。


FM可以看做是在LR的基础上增加了一部分二阶交叉项。引入自动的交叉特征有助于减少人工挖掘的投入，同时增加模型的非线性，捕捉更多信息。FM能够自动学习两两特征间的关系，但更高量级的特征交叉仍然无法满足。


GBDT是一个Boosting的模型，通过组合多个弱模型逐步拟合残差得到一个强模型。树模型具有天然的优势，能够很好的挖掘组合高阶统计特征，兼具较优的可解释性。GBDT的主要缺陷是依赖连续型的统计特征，对于高维度稀疏特征、时间序列特征不能很好的处理。


深度模型优势体现在如下几个方面：

  
## generate_batch.py
def generate_batch(batch_size, data_vec, word_to_int):
    n_chunk = len(data_vec) // batch_size
    x_batches = []
    y_batches = []
    for i in range(n_chunk):
        start_index = i * batch_size
        end_index = start_index + batch_size


        batches = data_vec[start_index:end_index]
	# apply方法同时为多列赋值操作

	import pandas as pd

	df = pd.read_excel('test.xlsx',encoding='utf8')
	df[['col2', 'col3']] = df.apply(lambda row: ps.Series(func(row[0])), axis=1)

	def func(col1):
	# 为col2和col3赋同一值
	return col1, col1
	# ipython 生成矢量图
	import matplotlib
	import matplotlib.pyplot as plt
	%matplotlib inline
	%config InlineBackend.figure_format = 'svg'
	# 梯度裁剪
	# https://github.com/tensorflow/models/blob/56cbd1f2770f1e7386db43af37f6f11b4d85e3da/tutorials/rnn/ptb/ptb_word_lm.py#L159-L165
	tvars = tf.trainable_variables()
	grads, _ = tf.clip_by_global_norm(tf.gradients(self._cost, tvars),
	config.max_grad_norm)
	optimizer = tf.train.GradientDescentOptimizer(self._lr)
	self._train_op = optimizer.apply_gradients(
	zip(grads, tvars),
	global_step=tf.train.get_or_create_global_step())
	"""Calculates frequencies of terms in documents and in corpus. Also computes inverse document frequencies."""
	for document in self.corpus:
	frequencies = {}
	self.doc_len.append(len(document))
	for word in document:
	if word not in frequencies:
	frequencies[word] = 0
	frequencies[word] += 1
	self.f.append(frequencies)
	def generate_batch(batch_size, data_vec, word_to_int):
	n_chunk = len(data_vec) // batch_size
	x_batches = []
	y_batches = []
	for i in range(n_chunk):
	start_index = i * batch_size
	end_index = start_index + batch_size


	batches = data_vec[start_index:end_index]