Skip to content

Instantly share code, notes, and snippets.

View nlpjoe's full-sized avatar
🎯
Focusing

Jzzhou nlpjoe

🎯
Focusing
View GitHub Profile
@nlpjoe
nlpjoe / main.md
Last active July 23, 2020 12:20
[流畅的python]#python
@nlpjoe
nlpjoe / main.md
Created July 8, 2020 07:20
[SQL语句如何统计一列中的值重复出现的次数,查询出的结果按次数的倒序排列?] #sql
select 要统计的列名,count(*) from table
group by 要统计的列名
order by count(*) desc
@nlpjoe
nlpjoe / main.md
Last active July 12, 2020 17:00
[finetune XLM-Roberta with MLM] #nlp
@nlpjoe
nlpjoe / main.md
Last active June 21, 2020 13:13
[分享会]#ml

[1] 6月23日比赛结束

[2] 预训练模型框架

[3] 数据增强,类别平衡

[4] 工具分享-github,gist

[option] 基础模型

@nlpjoe
nlpjoe / main.md
Created June 21, 2020 12:13
[分享会] #ml

数据增强 类别平均

预训练模型 transformers

@nlpjoe
nlpjoe / RocAucMeter.md
Last active August 26, 2020 08:28
[AUC、cos评价指标、f1 score] #pytorch #ml

Macro F1: 将n分类的评价拆成n个二分类的评价,计算每个二分类的F1 score,n个F1 score的平均值即为Macro F1。

Micro F1: 将n分类的评价拆成n个二分类的评价,将n个二分类评价的TP、FP、RN对应相加,计算评价准确率和召回率,由这2个准确率和召回率计算的F1 score即为Micro F1。

一般来讲,Macro F1、Micro F1 高的分类效果好。Macro F1受样本数量少的类别影响大。

基本元素:

(1)若一个实例是正类,并且被预测为正类,即为真正类TP(True Positive ) (2)若一个实例是正类,但是被预测为负类,即为假负类FN(False Negative )

@nlpjoe
nlpjoe / BalanceClassSampler.md
Last active June 29, 2020 09:45
[class balance] #pytorch #ml
from torch.utils.data.sampler import Sampler

class BalanceClassSampler(Sampler):
    """Abstraction over data sampler.
    Allows you to create stratified sample on unbalanced classes.
    """

    def __init__(self, labels: List[int], mode: str = "downsampling"):
        """
@nlpjoe
nlpjoe / [工具整理] #pytorch
Last active September 17, 2020 09:12
[utils] #ml #pytorch
oss连接工具 IO
odps读table
pytorch训练/测试 读odps表 dataset
@nlpjoe
nlpjoe / generate_synthetic.md
Last active August 27, 2020 08:40
[nlp albumentation, 数据增强] #pytorch #ml

最强文章 https://neptune.ai/blog/data-augmentation-nlp

class SynthesicOpenSubtitlesTransform(NLPTransform):
    def __init__(self, always_apply=False, p=0.5):
        super(SynthesicOpenSubtitlesTransform, self).__init__(always_apply, p)
        df = pd.read_csv(f'{ROOT_PATH}/input/open-subtitles-toxic-pseudo-labeling/open-subtitles-synthesic.csv', index_col='id')[['comment_text', 'toxic', 'lang']]
        df = df[~df['comment_text'].isna()]
        df['comment_text'] = df.parallel_apply(lambda x: clean_text(x['comment_text'], x['lang']), axis=1)
        df = df.drop_duplicates(subset='comment_text')