Skip to content

Instantly share code, notes, and snippets.

View ettorerizza's full-sized avatar
🏠
Working from home

Ettore Rizza ettorerizza

🏠
Working from home
View GitHub Profile
@ettorerizza
ettorerizza / Search_Wikipedia.py
Last active June 24, 2016 14:01
# Ce script récupère une liste de noms et vérifie d'abord s'il existent dans Wikipedia.fr, puis dans Wikipedia.nl
# -*- coding: utf-8 -*-
######################################################
#
# Ce script récupère une liste de noms et vérifie
# d'abord s'il existent dans Wikipedia.fr, puis
# dans Wikipedia.nl
#
######################################################
@ettorerizza
ettorerizza / create_column_openrefine.py
Last active March 7, 2017 11:21
This script takes as input a Json file of Open Refine and returns the same file in which each "transform" and each "mass edit" will be documented in a column
#!/usr/bin/python3
import json
with open("test.json", "r") as infile:
data = json.load(infile)
def transform_to_addcolumn(data):
data_trans = dict(data)
data_trans["op"] = "core/column-addition"
data_trans["expression"] = (
@ettorerizza
ettorerizza / refinetranslator.py
Last active April 28, 2018 00:04
a mini Python3 script that transforms a list of operations performed in Open Refine into a text file easier to read. To use it, paste your Open Refine "undo/redo" history in a file named, for example, "operations.json", place this file in the same folder as the Python script, and run this command : python refinetranslator.py operations.json
#!/usr/bin/python3
import json
import sys
with open(sys.argv[1], "r") as infile:
data = json.load(infile)
outfile = open(sys.argv[1]+".txt", 'w')
count = 1
@ettorerizza
ettorerizza / merge_and_reshape_topics_matrice.R
Last active May 21, 2017 21:27
prend les matrices de plusieurs topic modellings et les reformate
library(dplyr)
library(data.table)
library(stringr)
#dossier contenant les fichiers
setwd("C:/Users/ettor/Desktop/Eurovoc Topicmodeling/presidencies")
#on merge les trois
files <- list.files(path = getwd(),
pattern = ".txt")
@ettorerizza
ettorerizza / Open Refine fingerprint function in R
Last active May 28, 2017 11:57
Given a character vector as input, get the key collision fingerprint for each element. Forked from refinr package.
#' Get key collision fingerprints
#'
#' Given a character vector as input, get the key collision fingerprint for
#' each element.
#'
#' Operations in order :
#'
#'-remove leading and trailing whitespace
#'-change all characters to their lowercase representation
#'-remove all punctuation and control characters
@ettorerizza
ettorerizza / parse_jrc-acquis
Created May 30, 2017 16:11
Script R pour parser les 26 000 XML/TEI du corpus européen JRC-Acquis et leur ajouter leurs descripteurs eurovoc
library(XML)
library(dplyr)
library(stringr)
library(readr)
library(readxl)
library(tidyr)
#liste des fichiers XML du corpus JRC Acquis version anglaise (http://optima.jrc.it/Acquis/JRC-Acquis.3.0/corpus/jrc-en.tgz)
liste <-
list.files(
@ettorerizza
ettorerizza / levenshtein.py
Last active October 31, 2017 11:16
A function for calculating the Levensthein edit distance between columns with Jython in Open Refine
def call_counter(func):
def helper(*args, **kwargs):
helper.calls += 1
return func(*args, **kwargs)
helper.calls = 0
helper.__name__= func.__name__
return helper
memo = {}
@call_counter
def levenshtein(s, t):
@ettorerizza
ettorerizza / postag_refine.py
Created July 2, 2017 17:04
OpenRefine/jython POS tagging with parsetree
import sys
sys.path.append(r'D:\jython2.7.0\Lib\site-packages')
from pattern.fr import parsetree
sentences = parsetree(value, relations=True, lemmata=True)
liste = []
for s in sentences:
for chunk in s.chunks:
for w in chunk.words:
@ettorerizza
ettorerizza / sparql_refine.py
Created July 2, 2017 17:05
OpenRefine/Jython sparql query (find possible locations and persons in tokens)
import sys
sys.path.append(r'D:\jython2.7.0\Lib\site-packages')
from SPARQLWrapper import SPARQLWrapper, JSON
from langdetect import detect
dbpedia_version = "http://dbpedia.org/sparql"
#TEST
value = "comptoir"
@ettorerizza
ettorerizza / extract_names.py
Last active July 9, 2017 08:53
Jython naive method to detect potential persons names in OpenRefine based on a list of first names
from unidecode import unidecode
with open(r"C:\Users\Boulot\Desktop\prenoms.txt", 'r') as f:
prenoms = [name.strip().lower() for name in f]
CHARS = "abcdefghijklmnopqrstuvwxyzéèàçüûùABCDEFGHIJKLMNOPQRSTUVWXYZ- "
family_joint = ["d'", "de", "du", "der", "den", "vander", "vanden", "van", "le"]
#TEST