Skip to content

Instantly share code, notes, and snippets.

@mariushoch

mariushoch/README.md

Last active Apr 25, 2020
Embed
What would you like to do?
PropertySuggester update tools

Step by step

  • Run ./scheduleUpdateSuggester 20180312 on tool forge (replace 20180312 with the date of the latest JSON dump)
  • Wait
  • Check the logs at updateSuggester.err for problems during the creation
  • Run sha1sum analyzed-out (or whatever hashing algorithm you prefer)
  • gzip analyzed-out
  • Rsync analyzed-out.gz to your local machine, commit to the wbs_propertypairs repo.
  • Load it down to terbium (or some other maintenance host) with https_proxy=http://webproxy.eqiad.wmnet:8080 wget 'https://github.com/wmde/wbs_propertypairs/raw/master/20180312/wbs_propertypairs.csv.gz' (again, replace 20180312 with the date of the JSON dump you produced).
  • Unpack it: gzip -d
  • Compare the checksum to the one obtained on tool forge
  • Update the actual table: mwscript extensions/PropertySuggester/maintenance/UpdateTable.php --wiki wikidatawiki --file wbs_propertypairs.csv
  • Run T132839-Workarounds.sh (on terbium)
  • Log your changes
#!/bin/bash
if [[ -z $1 ]]; then
echo First argument needs to be the json dump use, like 20160905
exit 1
fi
# RM old logs
rm -f updateSuggester.err
rm -f updateSuggester.out
jsub -mem 3500m -N updateSuggester $HOME/updateSuggester.sh $1
#!/bin/bash
echo -n 'Removing ext ids in item context '
i=0
while [ $i -lt 40 ]; do
echo -n '.'
sql wikidatawiki --write -- --execute "DELETE FROM wbs_propertypairs WHERE pid1 IN (SELECT pi_property_id FROM wb_property_info WHERE pi_type = 'external-id') AND context = 'item' LIMIT 5000"
let i++
sleep 3
done
pids=(17 18 276 301 373 463 495 571 641 1344 1448 1476)
for pid in "${pids[@]}"; do
echo
echo "Removing P$pid item context"
sql wikidatawiki --write -- --execute "DELETE FROM wbs_propertypairs WHERE pid1 = '$pid' AND context = 'item' LIMIT 5000"
done
echo
echo "Removing P31 qualifier suggestions for P569, P570, P571, P576"
sql wikidatawiki --write -- --execute "DELETE FROM wbs_propertypairs WHERE context = 'qualifier' AND pid1 IN(569, 570, 571, 576) AND pid2 = 31 LIMIT 5000"
#!/bin/bash
if [[ -z $1 ]]; then
echo First argument needs to be the json dump use, like 20160905
exit 1
fi
set -ex
DUMP=/public/dumps/public/wikidatawiki/entities/$1/wikidata-$1-all.json.gz
if [ ! -s $DUMP ]; then
DUMP=$HOME/PropertySuggester-wikidata-$1-all.json.gz
fi
if [ ! -s $DUMP ]; then
echo $DUMP not found, manually downloading.
echo
curl https://dumps.wikimedia.org/wikidatawiki/entities/$1/wikidata-$1-all.json.gz > $DUMP
fi
cd $HOME/wikibase-property-suggester-scripts
# Active virtualenv
. bin/activate
export LC_ALL=en_US.UTF-8
# XXX: Could also use /tmp here instead of $HOME to take load of NFS, but then again /tmp might be to small
# XXX: What about /mnt/nfs/labstore1003-scratch?
PYTHONPATH=build/lib/ python3 ./build/lib/scripts/dumpconverter.py $DUMP > $HOME/dumpconvert.csv
PYTHONPATH=build/lib/ python3 ./build/lib/scripts/analyzer.py $HOME/dumpconvert.csv $HOME/analyzed-out
rm $HOME/dumpconvert.csv
rm -f $HOME/PropertySuggester-wikidata-$1-all.json.gz
@Ladsgroup

This comment has been minimized.

Copy link

@Ladsgroup Ladsgroup commented Apr 25, 2020

BTW running this on stats machines at any time puts the result in the database:

DUMP=/mnt/data/xmldatadumps/public/wikidatawiki/entities/latest-all.json.gz
DATE=$(readlink -f $DUMP | grep -Eo '20[0-9]+' | head -n 1)
echo Analyzing dump of $DATE
if [ -d "wikibase-property-suggester-scripts" ]; then
  rm -rf wikibase-property-suggester-scripts
fi

git clone "https://gerrit.wikimedia.org/r/wikibase/property-suggester-scripts" wikibase-property-suggester-scripts
cd wikibase-property-suggester-scripts
https_proxy=http://webproxy.eqiad.wmnet:8080 virtualenv -p python3 venv
source venv/bin/activate
https_proxy=http://webproxy.eqiad.wmnet:8080 pip install .
export LC_ALL=en_US.UTF-8
python scripts/dumpconverter.py $DUMP > dumpconvert.csv
python scripts/analyzer.py dumpconvert.csv analyzed-out
export CHECKSUM=$(sha1sum analyzed-out | awk '{print $1}')
echo Checksum: $CHECKSUM
rm dumpconvert.csv
gzip analyzed-out
mkdir -p "/srv/published/datasets/wmde-analytics-engineering/Wikidata/wbs_propertypairs/$DATE"
mv analyzed-out.gz "/srv/published/datasets/wmde-analytics-engineering/Wikidata/wbs_propertypairs/$DATE/analyzed-out.gz"
echo $CHECKSUM > "/srv/published/datasets/wmde-analytics-engineering/Wikidata/wbs_propertypairs/$DATE/checksum"

Like: https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wikidata/wbs_propertypairs/

I was thinking of putting it in a cronjob, what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.