Skip to content

Instantly share code, notes, and snippets.

@mariushoch
Last active January 21, 2021 19:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mariushoch/22f4ead44f75c5133e403f465bc279a5 to your computer and use it in GitHub Desktop.
Save mariushoch/22f4ead44f75c5133e403f465bc279a5 to your computer and use it in GitHub Desktop.
PropertySuggester update tools

Step by step

  • Run ./scheduleUpdateSuggester 20180312 on tool forge (replace 20180312 with the date of the latest JSON dump)
  • Wait
  • Check the logs at updateSuggester.err for problems during the creation
  • Run sha1sum analyzed-out (or whatever hashing algorithm you prefer)
  • gzip analyzed-out
  • Rsync analyzed-out.gz to your local machine, commit to the wbs_propertypairs repo.
  • Load it down to terbium (or some other maintenance host) with https_proxy=http://webproxy.eqiad.wmnet:8080 wget 'https://github.com/wmde/wbs_propertypairs/raw/master/20180312/wbs_propertypairs.csv.gz' (again, replace 20180312 with the date of the JSON dump you produced).
  • Unpack it: gzip -d
  • Compare the checksum to the one obtained on tool forge
  • Update the actual table: mwscript extensions/PropertySuggester/maintenance/UpdateTable.php --wiki wikidatawiki --file wbs_propertypairs.csv
  • Run T132839-Workarounds.sh (on terbium)
  • Log your changes
#!/bin/bash
if [[ -z $1 ]]; then
echo First argument needs to be the json dump use, like 20160905
exit 1
fi
# RM old logs
rm -f updateSuggester.err
rm -f updateSuggester.out
jsub -mem 3500m -N updateSuggester $HOME/updateSuggester.sh $1
#!/bin/bash
echo -n 'Removing ext ids in item context '
i=0
while [ $i -lt 40 ]; do
echo -n '.'
sql wikidatawiki --write -- --execute "DELETE FROM wbs_propertypairs WHERE pid1 IN (SELECT pi_property_id FROM wb_property_info WHERE pi_type = 'external-id') AND context = 'item' LIMIT 5000"
let i++
sleep 3
done
pids=(17 18 276 301 373 463 495 571 641 1344 1448 1476)
for pid in "${pids[@]}"; do
echo
echo "Removing P$pid item context"
sql wikidatawiki --write -- --execute "DELETE FROM wbs_propertypairs WHERE pid1 = '$pid' AND context = 'item' LIMIT 5000"
done
echo
echo "Removing P31 qualifier suggestions for P569, P570, P571, P576"
sql wikidatawiki --write -- --execute "DELETE FROM wbs_propertypairs WHERE context = 'qualifier' AND pid1 IN(569, 570, 571, 576) AND pid2 = 31 LIMIT 5000"
#!/bin/bash
if [[ -z $1 ]]; then
echo First argument needs to be the json dump use, like 20160905
exit 1
fi
set -ex
DUMP=/public/dumps/public/wikidatawiki/entities/$1/wikidata-$1-all.json.gz
if [ ! -s $DUMP ]; then
DUMP=$HOME/PropertySuggester-wikidata-$1-all.json.gz
fi
if [ ! -s $DUMP ]; then
echo $DUMP not found, manually downloading.
echo
curl https://dumps.wikimedia.org/wikidatawiki/entities/$1/wikidata-$1-all.json.gz > $DUMP
fi
cd $HOME/wikibase-property-suggester-scripts
# Active virtualenv
. bin/activate
export LC_ALL=en_US.UTF-8
# XXX: Could also use /tmp here instead of $HOME to take load of NFS, but then again /tmp might be to small
# XXX: What about /mnt/nfs/labstore1003-scratch?
PYTHONPATH=build/lib/ python3 ./build/lib/scripts/dumpconverter.py $DUMP > $HOME/dumpconvert.csv
PYTHONPATH=build/lib/ python3 ./build/lib/scripts/analyzer.py $HOME/dumpconvert.csv $HOME/analyzed-out
rm $HOME/dumpconvert.csv
rm -f $HOME/PropertySuggester-wikidata-$1-all.json.gz
@Ladsgroup
Copy link

BTW running this on stats machines at any time puts the result in the database:

DUMP=/mnt/data/xmldatadumps/public/wikidatawiki/entities/latest-all.json.gz
DATE=$(readlink -f $DUMP | grep -Eo '20[0-9]+' | head -n 1)
echo Analyzing dump of $DATE
if [ -d "wikibase-property-suggester-scripts" ]; then
  rm -rf wikibase-property-suggester-scripts
fi

git clone "https://gerrit.wikimedia.org/r/wikibase/property-suggester-scripts" wikibase-property-suggester-scripts
cd wikibase-property-suggester-scripts
https_proxy=http://webproxy.eqiad.wmnet:8080 virtualenv -p python3 venv
source venv/bin/activate
https_proxy=http://webproxy.eqiad.wmnet:8080 pip install .
export LC_ALL=en_US.UTF-8
python scripts/dumpconverter.py $DUMP > dumpconvert.csv
python scripts/analyzer.py dumpconvert.csv analyzed-out
export CHECKSUM=$(sha1sum analyzed-out | awk '{print $1}')
echo Checksum: $CHECKSUM
rm dumpconvert.csv
gzip analyzed-out
mkdir -p "/srv/published/datasets/wmde-analytics-engineering/Wikidata/wbs_propertypairs/$DATE"
mv analyzed-out.gz "/srv/published/datasets/wmde-analytics-engineering/Wikidata/wbs_propertypairs/$DATE/analyzed-out.gz"
echo $CHECKSUM > "/srv/published/datasets/wmde-analytics-engineering/Wikidata/wbs_propertypairs/$DATE/checksum"

Like: https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wikidata/wbs_propertypairs/

I was thinking of putting it in a cronjob, what do you think?

@mariushoch
Copy link
Author

@Ladsgroup That sounds like a good idea (monthly?).

@Ladsgroup
Copy link

Okay. I put it in stat1005 and it starts at fifth of every month and slowly we can just download and run it in production instead (I started one right now to make sure it works fine).

@Ladsgroup
Copy link

@mariushoch
Copy link
Author

So it works: https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wikidata/wbs_propertypairs/20210118/

I will try to do it next month ^^

Thanks, I'll also add a note for next month to check! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment