Skip to content

Instantly share code, notes, and snippets.

@LinguList

LinguList/README.md

Last active Feb 23, 2021
Embed
What would you like to do?
Working with WALS Data in CLDF

How to work with WALS data in CLDF

This code example accomanies a blog post published as part of the blog "Computer-Assisted Langauge Comparison in Practice" (https://calc.hypotheses.org).

In order to get started, install the WALS dataset in CLDF format with the help of pip (ideally, make sure to use a fresh virtual environment!).

$ pip install -e git+https://github.com/cldf-datasets/wals.git@v2020#egg=cldfbench_wals

Once this has been done, you should be able to run the script wals.py by simply typing:

$ python wals.py

For details, check https://calc.hypotheses.org/2670.

"""
Load WALS data and convert them to a table.
"""
from cldfbench import get_dataset
from collections import OrderedDict
import codecs
wals = get_dataset("wals").cldf_reader()
languages = {row["ID"]: row for row in wals.iter_rows("LanguageTable")}
parameters = OrderedDict({row["ID"]: row for row in wals.iter_rows("ParameterTable")})
codes = OrderedDict({row["ID"]: row for row in wals.iter_rows("CodeTable")})
parameter_list = list(parameters)
varieties = {
language["ID"]: ["" for x in parameters] for language in languages.values()
}
for row in wals.iter_rows("ValueTable"):
pid = parameter_list.index(row["Parameter_ID"])
varieties[row["Language_ID"]][pid] = codes[row["Code_ID"]]["Name"]
count = 0
for i, param in enumerate(parameters):
if count == 20:
continue
if varieties["aab"][i]:
print(param, varieties["aab"][i])
count += 1
with codecs.open("wals_by_language.tsv", "w", "utf-8") as f:
f.write(
"\t".join(
[
"ID",
"Name",
"Glottocode",
"Family",
"Latitude",
"Longitude",
]
)
+ "\t"
+ "\t".join([row["Name"] for row in parameters.values()])
+ "\n"
)
for variety, values in varieties.items():
f.write(
"\t".join(
[
variety,
languages[variety]["Name"] or "",
languages[variety]["Glottocode"] or "",
languages[variety]["Family"] or "",
str(languages[variety]["Latitude"] or ""),
str(languages[variety]["Longitude"] or ""),
]
)
+ "\t"
+ "\t".join(values)
+ "\n"
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment