Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save hideojoho/5f1fda29e92ee109fb63723ecee16720 to your computer and use it in GitHub Desktop.
Save hideojoho/5f1fda29e92ee109fb63723ecee16720 to your computer and use it in GitHub Desktop.
How to run sparse retrieval on Japanese texts with Pyserini

How to run sparse retrieval on Japanese texts with Pyserini

VM Environments

  • Java 11.0.13
  • Maven 3.8.3
  • Lucene 8.10.1
  • Python 3.9.2

Get a VM with JDK11 and Maven

$ docker pull maven:3.8.3-openjdk-11
$ docker run -it maven:3.8.3-openjdk-11 bash
root@277eb5500d97:/# java -version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment 18.9 (build 11.0.13+8)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8, mixed mode, sharing)
root@277eb5500d97:/# mvn -version
Apache Maven 3.8.3 (ff8e977a158738155dc465c6a97ffaf31982d739)
Maven home: /usr/share/maven
Java version: 11.0.13, vendor: Oracle Corporation, runtime: /usr/local/openjdk-11
Default locale: en, platform encoding: UTF-8
OS name: "linux", version: "4.19.104-microsoft-standard", arch: "amd64", family: "unix"

Install Lucene and Anserini in VM

wget "https://dlcdn.apache.org/lucene/java/8.10.1/lucene-8.10.1.tgz"
tar xvfz lucene-8.10.1.tgz
export LUCENE_HOME=/lucene-8.10.1
export CLASSPATH=$CLASSPATH:$LUCENE_HOME/core/lucene-core-8.10.1.jar
export CLASSPATH=$CLASSPATH:$LUCENE_HOME/queryparser/lucene-queryparser-8.10.1.jar
export CLASSPATH=$CLASSPATH:$LUCENE_HOME/analysis/common/lucene-analyzers-common-8.10.1.jar
export CLASSPATH=$CLASSPATH:$LUCENE_HOME/demo/lucene-demo-8.10.1.jar

git clone --recurse-submodules https://github.com/castorini/anserini.git
cd anserini && mvn clean package appassembler:assemble -DskipTests -Dmaven.javadoc.skip=true

Install pyserini and other packages

apt update
apt install -y python3 python3-pip
python3 -m pip install -U pip
python3 -m pip install pyserini faiss-cpu torch

Sample texts

  • ja_texts/text.jsonl
{"id": "doc1", "contents": "吾輩わがはいは猫である。名前はまだ無い。"}
{"id": "doc2", "contents": "どこで生れたかとんと見当けんとうがつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。"}
{"id": "doc3", "contents": "吾輩はここで始めて人間というものを見た。しかもあとで聞くとそれは書生という人間中で一番獰悪どうあくな種族であったそうだ。"}

Index texts

mkdir -p indexes/ja_texts
python3 -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator -threads 1 -language ja -input ja_texts -index indexes/ja_texts  -storePositions -storeDocvectors -storeRaw

Search texts

  • SimpleSearcher.py
from pyserini.search import SimpleSearcher

q = '吾輩'
searcher = SimpleSearcher('indexes/ja_texts')
searcher.set_language('ja')
hits = searcher.search(q)

for i in range(len(hits)):
    print(f'{i+1:2} {hits[i].docid:4} {hits[i].score:.5f}')
  • Results
python3 SimpleSearcher.py
 1 doc1 0.27330
 2 doc3 0.23620

Fetch texts

  • SimpleFetcher.py
from pyserini.search import SimpleSearcher
import json

docid = 'doc1'
searcher = SimpleSearcher('indexes/ja_texts')

json_doc = json.loads(searcher.doc(docid).raw())
print(json_doc['contents'])
  • Result
python3 SimpleFetcher.py
吾輩わがはいは猫である。名前はまだ無い。

URLs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment