Skip to content

Instantly share code, notes, and snippets.

View minhpqn's full-sized avatar

Pham Quang Nhat Minh minhpqn

View GitHub Profile
@minhpqn
minhpqn / useful_pandas_snippets.py
Created February 23, 2016 05:58 — forked from bsweger/useful_pandas_snippets.md
Useful Pandas Snippets
#List unique values in a DataFrame column
pd.unique(df.column_name.ravel())
#Convert Series datatype to numeric, getting rid of any non-numeric values
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
#Grab DataFrame rows where column has certain values
valuelist = ['value1', 'value2', 'value3']
df = df[df.column.isin(value_list)]
@minhpqn
minhpqn / .screenrc
Created August 11, 2016 01:22 — forked from joaopizani/.screenrc
A killer GNU Screen Config
# the following two lines give a two-line status, with the current window highlighted
hardstatus alwayslastline
hardstatus string '%{= kG}[%{G}%H%? %1`%?%{g}][%= %{= kw}%-w%{+b yk} %n*%t%?(%u)%? %{-}%+w %=%{g}][%{B}%m/%d %{W}%C%A%{g}]'
# huge scrollback buffer
defscrollback 5000
# no welcome message
startup_message off
@minhpqn
minhpqn / no_accent_vietnamese.py
Created February 23, 2017 03:01 — forked from thuandt/no_accent_vietnamese.py
Chuyển đổi từ Tiếng Việt có dấu sang Tiếng Việt không dấu
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Chương trình chuyển đổi từ Tiếng Việt có dấu sang Tiếng Việt không dấu
Chỉnh sửa từ mã nguồn của anh NamNT
http://www.vithon.org/2009/06/14/x%E1%BB%AD-ly-ti%E1%BA%BFng-vi%E1%BB%87t-trong-python
"""
import re
INTAB = "ạảãàáâậầấẩẫăắằặẳẵóòọõỏôộổỗồốơờớợởỡéèẻẹẽêếềệểễúùụủũưựữửừứíìịỉĩýỳỷỵỹđẠẢÃÀÁÂẬẦẤẨẪĂẮẰẶẲẴÓÒỌÕỎÔỘỔỖỒỐƠỜỚỢỞỠÉÈẺẸẼÊẾỀỆỂỄÚÙỤỦŨƯỰỮỬỪỨÍÌỊỈĨÝỲỶỴỸĐ"
@minhpqn
minhpqn / print_cm.py
Created May 10, 2017 11:57 — forked from zachguo/print_cm.py
Pretty print for sklearn confusion matrix
from sklearn.metrics import confusion_matrix
def print_cm(cm, labels, hide_zeroes=False, hide_diagonal=False, hide_threshold=None):
"""pretty print for confusion matrixes"""
columnwidth = max([len(x) for x in labels]+[5]) # 5 is value length
empty_cell = " " * columnwidth
# Print header
print " " + empty_cell,
for label in labels:
print "%{0}s".format(columnwidth) % label,
@minhpqn
minhpqn / mini_sequence_labeler.py
Created January 25, 2018 09:57 — forked from hal3/mini_sequence_labeler.py
PyTorch implementation of a sequence labeler (POS taggger).
"""
PyTorch implementation of a sequence labeler (POS taggger).
Basic architecture:
- take words
- run though bidirectional GRU
- predict labels one word at a time (left to right), using a recurrent neural network "decoder"
The decoder updates hidden state based on:
- most recent word
@minhpqn
minhpqn / log.py
Created January 24, 2019 03:52 — forked from nguyenkims/log.py
Basic example on how setup a Python logger
import logging
import sys
from logging.handlers import TimedRotatingFileHandler
FORMATTER = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
LOG_FILE = "my_app.log"
def get_console_handler():
console_handler = logging.StreamHandler(sys.stdout)
console_handler.setFormatter(FORMATTER)
@minhpqn
minhpqn / rnn_minibatch.py
Created April 12, 2019 22:44 — forked from ritchie46/rnn_minibatch.py
minibatches in pytorch
"""
How to do minibatches for RNNs in pytorch
Assume we feed characters to the model and predict the language of the words.
"""
def prepare_batch(x, y):
# determine the maximum word length per batch and zero pad the tensors
n_max = max([a.shape[0] for a in x])
@minhpqn
minhpqn / download_glue_data.py
Created November 1, 2019 01:33 — forked from W4ngatang/download_glue_data.py
Script for downloading data of the GLUE benchmark (gluebenchmark.com)
''' Script for downloading all GLUE data.
Note: for legal reasons, we are unable to host MRPC.
You can either use the version hosted by the SentEval team, which is already tokenized,
or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
You should then rename and place specific files in a folder (see below for an example).
mkdir MRPC
cabextract MSRParaphraseCorpus.msi -d MRPC
@minhpqn
minhpqn / regex-japanese.txt
Created April 13, 2020 15:04 — forked from terrancesnyder/regex-japanese.txt
Regex for Japanese
Regex for matching ALL Japanese common & uncommon Kanji (4e00 – 9fcf) ~ The Big Kahuna!
([一-龯])
Regex for matching Hirgana or Katakana
([ぁ-んァ-ン])
Regex for matching Non-Hirgana or Non-Katakana
([^ぁ-んァ-ン])
Regex for matching Hirgana or Katakana or basic punctuation (、。’)
#! /usr/bin/env python
# -*- coding: utf-8 -*-
"""This module's docstring summary line.
This is a multi-line docstring. Paragraphs are separated with blank lines.
Lines conform to 79-column limit.
Module and packages names should be short, lower_case_with_underscores.
Notice that this in not PEP8-cheatsheet.py