Jzzhou nlpjoe

## main.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                nlpjoe
                / main.md
            
            
              Last active
              July 23, 2020 12:20
            
              
                [流畅的python]#python
              
          
    一、ABCMeta

abc.ABCMeta 是一个metaclass，用于在Python程序中创建抽象基类。
classes as objects

在理解元类之前，您需要掌握Python的类。  Python从Smalltalk语言中借用了一个非常特殊的类概念。
在大多数语言中，类只是描述如何生成对象的代码段。 在Python中也是如此：
https://stackoverflow.com/questions/100003/what-are-metaclasses-in-python(e-satis的回答)

  
## main.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                nlpjoe
                / main.md
            
            
              Created
              July 8, 2020 07:20
            
              
                [SQL语句如何统计一列中的值重复出现的次数,查询出的结果按次数的倒序排列？] #sql
              
          
    select 要统计的列名,count(*) from table
group by 要统计的列名
order by count(*) desc

  
## main.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                nlpjoe
                / main.md
            
            
              Last active
              July 12, 2020 17:00
            
              
                [finetune XLM-Roberta with MLM] #nlp
              
          
    https://www.kaggle.com/riblidezso/finetune-xlm-roberta-on-jigsaw-test-data-with-mlm
def prepare_mlm_input_and_labels(X):
    # 15% BERT masking
    inp_mask = np.random.rand(*X.shape)<0.15 
    # do not mask special tokens
    inp_mask[X<=2] = False
    # set targets to -1 by default, it means ignore
 labels = -1 * np.ones(X.shape, dtype=int)


## main.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                nlpjoe
                / main.md
            
            
              Last active
              June 23, 2020 16:26
            
              
                [approaching-almost-any-nlp-problem-on-kaggle、topic-modelling、EDA] #nlp #pytorch
              
          
    https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle/notebook
https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial
如何做EDA：
https://www.kaggle.com/tanulsingh077/twitter-sentiment-extaction-analysis-eda-and-model/execution#About-this-Competition

  
## main.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                nlpjoe
                / main.md
            
            
              Last active
              June 21, 2020 13:13
            
              
                [分享会]#ml
              
          
    [1] 6月23日比赛结束
[2] 预训练模型框架
[3] 数据增强，类别平衡
[4] 工具分享-github，gist
[option] 基础模型

  
## main.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                nlpjoe
                / main.md
            
            
              Created
              June 21, 2020 12:13
            
              
                [分享会] #ml
              
          
    数据增强
类别平均
预训练模型 transformers

  
## RocAucMeter.md

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                nlpjoe
                / RocAucMeter.md
            
            
              Last active
              August 26, 2020 08:28
            
              
                [AUC、cos评价指标、f1 score] #pytorch #ml
              
          
    Macro F1: 将n分类的评价拆成n个二分类的评价，计算每个二分类的F1 score，n个F1 score的平均值即为Macro F1。
Micro F1: 将n分类的评价拆成n个二分类的评价，将n个二分类评价的TP、FP、RN对应相加，计算评价准确率和召回率，由这2个准确率和召回率计算的F1 score即为Micro F1。
一般来讲，Macro F1、Micro F1 高的分类效果好。Macro F1受样本数量少的类别影响大。
基本元素：
（1）若一个实例是正类，并且被预测为正类，即为真正类TP(True Positive )
（2）若一个实例是正类，但是被预测为负类，即为假负类FN(False Negative )

  
## BalanceClassSampler.md

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                nlpjoe
                / BalanceClassSampler.md
            
            
              Last active
              June 29, 2020 09:45
            
              
                [class balance] #pytorch #ml
              
          
    from torch.utils.data.sampler import Sampler

class BalanceClassSampler(Sampler):
    """Abstraction over data sampler.
    Allows you to create stratified sample on unbalanced classes.
    """

    def __init__(self, labels: List[int], mode: str = "downsampling"):
        """

  
## [工具整理] #pytorch
oss连接工具   IO

odps读table

pytorch训练/测试 读odps表  dataset


## generate_synthetic.md

      
              3 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                nlpjoe
                / generate_synthetic.md
            
            
              Last active
              August 27, 2020 08:40
            
              
                [nlp albumentation, 数据增强] #pytorch #ml
              
          
    最强文章 https://neptune.ai/blog/data-augmentation-nlp
class SynthesicOpenSubtitlesTransform(NLPTransform):
    def __init__(self, always_apply=False, p=0.5):
        super(SynthesicOpenSubtitlesTransform, self).__init__(always_apply, p)
        df = pd.read_csv(f'{ROOT_PATH}/input/open-subtitles-toxic-pseudo-labeling/open-subtitles-synthesic.csv', index_col='id')[['comment_text', 'toxic', 'lang']]
        df = df[~df['comment_text'].isna()]
        df['comment_text'] = df.parallel_apply(lambda x: clean_text(x['comment_text'], x['lang']), axis=1)
        df = df.drop_duplicates(subset='comment_text')
	oss连接工具 IO

	odps读table

	pytorch训练/测试读odps表 dataset