Skip to content

Instantly share code, notes, and snippets.

@kumeS
Created October 10, 2020 19:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kumeS/db75fdb0389f9a84846c58f5dcf302c1 to your computer and use it in GitHub Desktop.
Save kumeS/db75fdb0389f9a84846c58f5dcf302c1 to your computer and use it in GitHub Desktop.
Japanese Morphological Analysis using SudachiPy in R
SudachiTokenizerR <- function(text='国家公務員', mode=SplitA, N=100){
try(for(n in 0:N){
#n <- 0
if(n == 0){ m <- array(NA, dim=(7)) }
try(a <- tokenizer_obj$tokenize(text=text, mode=mode)[n], silent=T)
m[1] <- as.character(a)
m[2:7] <- a$part_of_speech()
if(n == 0){
sudachipyResults <- m
}else{
if(n == 1){
sudachipyResults <- abind::abind(sudachipyResults, m, along=0)
}else{
sudachipyResults <- abind::abind(sudachipyResults, m, along=1)
}
}
}, silent=T)
if(length(dim(sudachipyResults)) == 1){
sudachipyResults <- t(data.frame(sudachipyResults))
}else{
sudachipyResults <- data.frame(sudachipyResults)
}
rownames(sudachipyResults) <- 1:nrow(sudachipyResults)
colnames(sudachipyResults) <- c("語彙", "品詞タグ01", "品詞タグ02", "品詞タグ03", "品詞タグ04", "品詞タグ05", "品詞タグ06")
return(sudachipyResults)
}
#実行例
#SudachiTokenizerR(text='国家公務員')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment