Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save javieraespinosa/38fa82cb5730c0223872e1eb46da2088 to your computer and use it in GitHub Desktop.
Save javieraespinosa/38fa82cb5730c0223872e1eb46da2088 to your computer and use it in GitHub Desktop.
DOING-MADICS NaLIR Tutorial.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"jupytext": {
"text_representation": {
"extension": ".md",
"format_name": "markdown",
"format_version": "1.2",
"jupytext_version": "1.4.2"
}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
},
"colab": {
"name": "DOING-MADICS NaLIR Tutorial.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true,
"include_colab_link": true
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/javieraespinosa/38fa82cb5730c0223872e1eb46da2088/doing-madics-nalir-tutorial.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iqoUQdNh6kAy"
},
"source": [
"# Introduction\n",
"\n",
"This notebook contains everything you need for running the [original tutorial](https://github.com/pr3martins/nalir-sbbd/blob/master/NaLIR.ipynb) in [Google Colaboratory](http://colab.research.google.com/). \n",
"\n",
"\n",
"Remarks:\n",
"* The tutorial requires a MySQL instance preconfigured with a database called `mas` (10 GB). For simplicity, we prepared the database in advance to avoid loading and configuring the database from scratch (around 1 hr).\n",
"* The original notebook contained text in Portuguese. The text was translated using [DeepL](http://deepl.com/). \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NBW4mWqarbxM"
},
"source": [
"# Setup\n",
"\n",
"Global variables\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "hyRXRMGoDmDH"
},
"source": [
"DATABASE_HOST = \"localhost\"\n",
"DATABASE_USER = \"root\"\n",
"DATABASE_PASSWORD = \"tutorial\"\n",
"DATABASE_DATABASE = \"mas\"\n",
"\n",
"PATH_JARS = \"/content/jars/new_jars\"\n",
"PATH_ZFILES = \"/content/zfiles\""
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "xRtKIDvBqtlk"
},
"source": [
"### MySQL (~10 mins)"
]
},
{
"cell_type": "code",
"metadata": {
"id": "JcMGgZ2LZ1Ge"
},
"source": [
"print(\"Installing MySQL (~2 min)...\")\n",
"!apt-get install mysql-server > /dev/null 2>&1\n",
"\n",
"print(\"Downloading the MAS database (~2 min)...\")\n",
"!gdown -q 'https://drive.google.com/u/0/uc?id=1iOloq20_N7E7RT0-IygMYdxiT6ajgCFg' > /dev/null 2>&1\n",
"\n",
"print(\"Configuring MySQL with MAS (~3 min)...\")\n",
"!tar -xf mysql_data.tar.gz\n",
"!rm -r /var/lib/mysql\n",
"!mv mysql/ /var/lib/\n",
"\n",
"# fix permissions\n",
"!usermod -d /var/lib/mysql/ mysql\n",
"!chown -R mysql:mysql /var/lib/mysql\n",
"\n",
"# start mysql\n",
"!service mysql start\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "2JC2NPSfrEMN"
},
"source": [
"Test connection to MySQL"
]
},
{
"cell_type": "code",
"metadata": {
"id": "eUa_TrMqZ1T9"
},
"source": [
"!pip -q install PyMySQL\n",
"\n",
"%load_ext sql\n",
"\n",
"%config SqlMagic.feedback=False \n",
"%config SqlMagic.autopandas=True\n",
"%sql mysql+pymysql://$DATABASE_USER:$DATABASE_PASSWORD@/\n",
"\n",
"df = %sql SHOW DATABASES\n",
"df"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "CGtX8bZhkHlM"
},
"source": [
"### NaLIR"
]
},
{
"cell_type": "code",
"metadata": {
"id": "PNLFIIoTL-5j"
},
"source": [
"# Download tutorial files\n",
"!git clone -q https://github.com/pr3martins/nalir-sbbd.git \n",
"\n",
"# Replicate the original notebook configuration\n",
"!mv nalir-sbbd/* .\n",
"!rm -r nalir-sbbd\n",
"\n",
"# Install python dependencies\n",
"!pip install -q -r requirements.txt\n",
"\n",
"# Download & configuration of NaLIR jars\n",
"!gdown -q 'https://drive.google.com/u/0/uc?id=1ggwTbEQsYHb0idMpr0qWvKjk7ulSJQTy' > /dev/null 2>&1\n",
"!unzip -q jars.zip\n",
"!rm jars.zip\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "RvqSUoz4zZzA"
},
"source": [
"### NLTK"
]
},
{
"cell_type": "code",
"metadata": {
"id": "PIcdHwy6zXjP"
},
"source": [
"import nltk\n",
"nltk.download('punkt', quiet=True)\n",
"nltk.download('averaged_perceptron_tagger', quiet=True)\n",
"nltk.download('wordnet', quiet=True)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-25T23:09:25.518921Z",
"start_time": "2020-09-25T23:09:24.833765Z"
},
"id": "3C9B1DJ9L1hV"
},
"source": [
"# NaLIR Tutorial (SBDD 2020)\n",
"\n",
"**Authors:** Altigran da Silva, Brandell Ferreira, Lucas Citolin & Paulo Martins\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ElcMwGRIL1hW"
},
"source": [
"\n",
"In this Notebook we will show the implementation of a Natural Language Interface for a Database. The general architecture of the system will be presented, and for each module, it will be listed what they receive as input, and what they generate as output. In addition, we will show implementation details that are interesting in each module. At the end, we will show some examples of queries for which NaLIR can get a valid SQL. It is important to point out that we only ported the code to Python, the original Java code is in this [repository](https://github.com/umich-dbgroup/NaLIR).\n",
"\n",
"> _Nesse Notebook mostraremos a implementação de uma Interface de Linguage Natural para Banco de Dados. Será apresentada a arquitetura geral do Sistema, e para cada módulo, será listado o que recebem como entrada, e o que geram como saída. Além disso, mostraremos detalhes de implementação que são interessantes em cada módulo. Ao final, mostraremos alguns exemplos de consultas para as quais o NaLIR consegue obter uma SQL válida. É importante ressaltar que apenas portamos o código para Python, o código original em Java se encontra neste [repositório](https://github.com/umich-dbgroup/NaLIR)._"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TH_OKwN6ghFq"
},
"source": [
"## General Architecture"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "c49HUe-nL1hW"
},
"source": [
"<a id='arquitetura_geral'></a>\n",
"\n",
"NaLIR is a Natural Language Interface that uses a framework derived from the Dependency Parser to translate natural language queries into a structured database query language (SQL). In addition to using the Dependency Parser, NaLIR also uses user interactions to correct possible errors in the Dependency tree structure and in the chosen default mappings from query words to Database elements.\n",
"\n",
"> _O NaLIR é uma Interface de Lingaugem Natural que utiliza uma estrutura derivada do Parser de Dependência para traduzir as consultas de linguagem natural para uma linguagem estruturada de consulta de banco de dados (SQL). Além de utilizar o Parser de Dependência, o NaLIR também utiliza interações com o usuário para corrigir possíveis erros na estrutura da árvore de Dependência e nos mapeamentos padrões escolhidos das palavras da consulta para elementos de Banco de Dados._\n",
"\n",
"As we can see in the figure below, NaLIR's architecture has **5** components: **Dependecy Parser**, **Query Node Mapper**, **Query Tree Strucutre Adjustor**, **Query Translator** and **Interactive Communicator**. The **Depency Parser** receives the query from the user,and passes the input to the Stanford Dependency parser. After extracting the dependency tree, the **Query Node Mapper** is responsible for mapping the nodes of the dependency tree to elements of a SQL structure.\n",
"\n",
"> _Como podemos ver na figura Abaixo, A arquitetura do NaLIRpossui **5** Componentes: **Dependecy Parser**, **Query Node Mapper**, **Query Tree Strucutre Adjustor**, **Query Translator** e **Interative Comunicator**. O **Depency Parser** recebe a consulta do usuário,e repassa a entrada para o parser de Dependêcia de Stanford. Após retirar gerar a Árvore de Depência, o **Query Node Mapper** é reponsável por mapear os nós da árvore de dependência para elementos da estrutura de uma SQL._\n",
"\n",
"\n",
"\n",
"After the mappings have been chosen, the **Interactive Communicatior** first goes into action, providing the user with the candidates found by the **Query Node Mapper**. After the user chooses the mappings for each Node, the **Query Tree structure Adjustor** runs an algorithm to modify the Tree structure so that the conversion to an SQL query is made easier. Such modifications are done following a grammar. The output of this step is a list of trees, sorted by a score that indicates how correct the tree is and the relationship between the database elements that the nodes reference.\n",
"\n",
"\n",
"> _Após a escolha dos mapementos, o **Interactive Communicatior** entra em ação pela primeira vez, fornecendo para o usuário os candidatos encontrados pelo **Query Node Mapper**. Após o usuário escolher os mapeamentos para cada Nó, **Query Tree structure Adjustor** executa um algoritmo para modificar a estrutura da Árvore para que a conversão para uma consulta SQL seja facilitada. Tais modificações são feitas seguindo uma gramática. A saída dessa etapa é uma lista de árvores, ordenadas por uma pontuação que indica o quão correta está a árvore e a relação entre os elementos do banco de dados que os nós referenciam._\n",
"\n",
"\n",
"\n",
"For the second time, the **Interactive Communicator** kicks in, so that the user chooses which tree to select. After that, the **Query Tree Structure Adjustor** inserts nodes into the tree selected by the user in order to remove possible ellipses from the query. For the last time **Interactive Commnunicator** shows the user the insertions made. After the user's action, the query tree is passed to the **Query Tree Translator** that effectively converts the tree into valid SQL.\n",
"\n",
"> _Pela Segunda vez, o **Interactive Comunicator** entra em ação, para que o usuário escolha qual árvore será escolhida pelo usuario. Após isso, **Query Tree Structure Adjustor** insere nós na árvore selecionada pelo usuário com o intuito de retirar possíveis elipses da consulta. Pela última vez **Interactive Commnunicator** mostra ao usuário as inserções feitas. Após a ação do usuário, a árvore de consulta é repassada para o **Query Tree Translator** que efetivamente converte a árvore para uma SQL válida._\n",
"\n",
"\n",
"In our implementation, **we do not port the Interactive Communicator**, so the only possible flow is one where **NaLIR uses all the standard mappings** to resolve the query.\n",
"\n",
"> _Na nossa implementação, **não portamos o Interactive Communicator**, assim o único fluxo possível é aquele em que o **NaLIR utiliza todos os mapeamentos padrões** para resolver a consulta_.\n",
"\n",
"![Arquitetura Geral](https://github.com/pr3martins/nalir-sbbd/raw/master/imgs/OverallDiagram.png)\n",
"\n",
"\n",
"It is important to point out that although only four modules are present in the architecture of NaLIR, some auxiliary modules were implemented, such as the one that manages access to the Database. Although this module is present between the lines of the flow in the architecture shown above, it was necessary to implement it so that it was possible to build the system.\n",
"\n",
"\n",
"> _É importante salientar que apesar de apenas quatro módulos estarem presentes na arquitetura do NaLIR, alguns módulos auxiliares foram implementados,como o que gerencia o acesso ao Banco de Dados. Apesar deste módulo estar presente nas entrelinhas do fluxo na Arquitetura mostrada acima, foi preciso implementá-lo para que fosse possível a construção do Sistema._"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ffMXcwBBL1hX"
},
"source": [
"## Client Configuration "
]
},
{
"cell_type": "code",
"metadata": {
"id": "4PcNw_pPL1ha"
},
"source": [
"from nalir import *"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "lSkF_EiiL1hc"
},
"source": [
"The first step is to **instantiate** the configuration object responsible for loading the files required for Nalir to run. The configuration is done from a JSON object, whose properties should be:\n",
"\n",
">_O primeiro passo é **instanciar** o objeto de configuração responsável por carregar os arquivos requeridos para que a execução do Nalir possa ser feita. A configuração é feita a partir de um objeto JSON, cujo as propriedades devem ser:_\n",
"\n",
"| Properties | Type | Description |\n",
"| :-- | :-: | :-- | \n",
"| `connection` | Object | Armazena as configurações de conexão com o Banco de dados |\n",
"| `connection.host`| String | Host para a conexão com o Banco de Dados |\n",
"| `connection.password`| String | Senha para conectar com o Banco de Dados |\n",
"| `connection.user` | String | Usuario de Banco de Dados|\n",
"| `connection.database` | String | Nome do Banco de Dados utilizado |\n",
"| `logginMode` | String | Nivel de Login da Aplicação|\n",
"| `zfiles_path`| String | Diretório onde se encontra os arquivos de configuração do NaLIR|\n",
"| `jars_path` | String | Diretório onde se encontra os jars necessários para a execução do NaLIR |"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-25T23:09:25.524674Z",
"start_time": "2020-09-25T23:09:25.520353Z"
},
"id": "ZbhG0FclL1hc"
},
"source": [
"import json\n",
"\n",
"config_json = {\n",
" \"connection\":{\n",
" \"host\": DATABASE_HOST,\n",
" \"password\": DATABASE_PASSWORD,\n",
" \"user\": DATABASE_USER,\n",
" \"database\": DATABASE_DATABASE\n",
" },\n",
" \"loggingMode\": \"ERROR\",\n",
" \"zfiles_path\": PATH_ZFILES,\n",
" \"jars_path\": PATH_JARS\n",
"}\n",
"\n",
"config = ConfigHandler(reset=True, config_json_text= json.dumps(config_json))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "rLDLTSmcL1he"
},
"source": [
"After putting in all the necessary settings and creating the object, you need to create an instance of the module that manages access to the Database.\n",
"\n",
"> _Após colocar todas as configurações necessárias e criar o objeto, é preciso criar uma instância do módulo que gerencia o acesso ao Banco de Dados._"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-25T23:09:25.849306Z",
"start_time": "2020-09-25T23:09:25.526005Z"
},
"id": "z2PEWt1GL1he"
},
"source": [
"rdbms = RDBMS(config)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "CegN0TH3L1he"
},
"source": [
"To exemplify step by step how the NaLIR modules interact with the query, we will use the query ''return me the authors who have a paper in VLDB conference before 2002 after 1995.''.\n",
"\n",
"> _Para exemplificar passo a passo como os módulos do NaLIR interagem com a consulta, utilizaremos a consulta ''return me the authors who have a paper in VLDB conference before 2002 after 1995.''._"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-25T23:09:25.855589Z",
"start_time": "2020-09-25T23:09:25.852274Z"
},
"id": "oXxhw2AkL1hf"
},
"source": [
"query_line='return me the authors who have a paper in VLDB conference before 2002 after 1995.'\n",
"query = Query(query_line,rdbms.schema_graph)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZyPNPITpL1hz"
},
"source": [
"## Stanford Dependency Parser"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "G5YApKuKL1hz"
},
"source": [
"In the initialization of the `StanfordParser` class we perform 3 steps to transform the query into a dependency tree. Each step is represented by a function in the following code:\n",
"\n",
"> _Na inicialização da classe `StanfordParser` executamos 3 passos para transformar a consulta em uma árvore de dependência. Cada passo é representado por uma função no código a seguir:_\n",
"\n",
"``` python\n",
" def __init__(self,query,config):\n",
" \n",
" #....\n",
" \n",
" self.parse(query)\n",
" self.build_tree(query)\n",
" self.fix_conj(query)\n",
" \n",
" #....\n",
"```\n",
"\n",
"The `parse(query)` function runs the Dependency Parser on the incoming query, while the other `build_tree` and `fix_conj` are responsible for adapting the structure to a `ParseTree` structure defined within the package. Below we show how the `build_tree` function builds the Dependency Tree from the dependencies extracted by the Stanford Parser.\n",
"\n",
"\n",
"> _A função `parse(query)` executa o Parser de Dependência na consulta recebida, enquanto as outras `build_tree` e `fix_conj` são responsáveis por adaptar a estrutura para uma estrutura `ParseTree` definida dentro do pacote. A seguir mostramos como a função `build_tree` constrói a Árvore de Dependência a partir das dependências extraídas pelo Parser de Stanford._\n",
"\n",
"``` python\n",
" def build_tree(self,query):\n",
" query.parse_tree = ParseTree()\n",
" done_list = [False] * len(query.tree_table)\n",
" i = 0\n",
"\n",
" for tree_table_item in query.tree_table:\n",
" if tree_table_item[PARENT_IDX] == 0:\n",
" done_list[i] = True\n",
" query.parse_tree.build_node(tree_table_item)\n",
" i+=1\n",
"\n",
" finished = False\n",
" while not finished:\n",
" i = 0\n",
" for i in range(len(query.tree_table)):\n",
" if not done_list[i]:\n",
" if query.parse_tree.build_node(query.tree_table[i]):\n",
" done_list[i] = True\n",
" break\n",
"\n",
"\n",
" finished = True\n",
" for done_list_item in done_list:\n",
" if not done_list_item:\n",
" finished = False\n",
" break\n",
"```\n",
"\n",
"The `query.tree_table` is created in `parse(query)`. `fix_conj()` fixes conjunction relations (when two words are connected by \"and\" or \"or\"). Below is an example of the module output. \n",
"\n",
"\n",
"> _A tabela `query.tree_table` é criada em `parse(query)`. `fix_conj()` corrige relações de conjunção (quando duas palavras estão conectadas por \"e\" ou \"ou\"). Abaixo temos o exemplo da saída do módulo._ "
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-25T23:09:27.104088Z",
"start_time": "2020-09-25T23:09:25.857005Z"
},
"scrolled": true,
"id": "AkqU_MkiL1h0"
},
"source": [
"StanfordParser(query,config)\n",
"query.parse_tree"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "FYOGJ1wOL1h1"
},
"source": [
"**Note**: Graph vizualitation has the program [Graphviz](https://graphviz.org/) as a dependency, so it needs to be installed. Check the website to see how it can be downloaded to your system\n",
"\n",
"\n",
"> _O graph vizualitation possui o programa [Graphviz](https://graphviz.org/) como dependêcia, logo ele precisa ser instalado. Verifique no site como é possível baixá-lo para seu sistema_"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-25T23:09:27.135015Z",
"start_time": "2020-09-25T23:09:27.105509Z"
},
"id": "b4tDoutvL1h1"
},
"source": [
"query.parse_tree.show()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "15t0r234f_-E"
},
"source": [
"## Node Mapper"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XOy6Ab56L1h2"
},
"source": [
"As stated in [General Architecture](#arquitetura_geral), the **Query Tree Node Maper** module, which in the code is represented by `NodeMapper`, has the function of mapping the nodes of the dependency tree to the elements of the SQL structure. The node types are: \n",
"\n",
"> _Como dito em [Arquitetura Geral](#arquitetura_geral), o módulo de **Query Tree Node Maper**, que no código é repesentado por `NodeMapper`, possui a função de mapear os nós da árvore de dependência para os elementos da estrutura SQL. Os tipos de nós, são:_ \n",
"\n",
"\n",
"<a id=\"tipo_no\"></a>\n",
"\n",
"| Node type | Equivalent SQL component |\n",
"| :-- | :-- |\n",
"| Select Node (SN) |palavra reservada SQL : SELECT |\n",
"| Operator Node (ON) | operadores ( !=, >,<,=, contains) |\n",
"| Function Node (FN) | Funções de Agregação (AVG) |\n",
"| Name Node (NN) | Nome de Relação ou de Atributo |\n",
"| Value Node (VN) | Valor de um determinado atributo |\n",
"| Quantifier Node (QN) | Palavras como ALL, ANY, EACH |\n",
"| Logic Node (LN) | Operadores Booleanos (AND, OR, NOT) |\n",
"\n",
"\n",
"Excluding the **Name** and **Value** nodes, all others are defined from an xml file. This file is in the `zfiles` directory, named `tokens.xml` and has the following format: \n",
"\n",
"> _Excluindo os Nós **Name** e **Value** nodes, todos os outros são definidos a partir de um arquivo xml. Esse arquivo se encontra dentro do diretório `zfiles`, como o nome de `tokens.xml` e possui o seguinte formato:_ \n",
"\n",
"``` xml\n",
"<types>\n",
" <!-- Command Token, verb -->\n",
" <CMT_V>\n",
" <phrase>\n",
" tell\n",
" <example>Tell me all the books published in year 1993.</example>\n",
" </phrase>\n",
" <phrase>\n",
" give\n",
" <example>Give me all the books published in year 1993.</example>\n",
" </phrase>\n",
" <phrase>\n",
" return\n",
" <example>Return all the books published in year 1993.</example>\n",
" </phrase>\n",
" </CMT_V>\n",
" <!-- Function Token, adjective -->\n",
" <FT>\n",
" <phrase>\n",
" minimum\n",
" <function>min</function>\n",
" <example>Find the books with the minimum price.</example>\n",
" </phrase>\n",
" <phrase>\n",
" most\n",
" <function>max</function>\n",
" </phrase>\n",
" <phrase>\n",
" biggest\n",
" <function>max</function>\n",
" </phrase>\n",
" <phrase>\n",
" least\n",
" <function>min</function>\n",
" </phrase>\n",
" </FT>\n",
" <!-- Operator Token, adj -->\n",
" <OT>\n",
" <phrase>\n",
" earlier\n",
" <operator>&lt;</operator>\n",
" <example>Find all the books of Ayn Rand, where the year of each book is earlier than the year of \"Fountain Head\".</example>\n",
" </phrase>\n",
" <phrase>\n",
" later\n",
" <operator>&gt;</operator>\n",
" <example>Find all the books of Ayn Rand, where the year of each book is later than the year of \"Fountain Head\".</example>\n",
" </phrase>\n",
" </OT>\n",
"</types>\n",
"```\n",
"\n",
"So if whoever is using the System wants to increase the granularity and interpretation of NaLIR, they can simply increase the list of tokens in this file. The following code shows the mapping function steps.\n",
"\n",
"> _Assim, se quem estiver utilizando o Sistema quiser aumentar a granularidade e a interpretação do NaLIR, basta aumentar a lista de tokens deste arquivo. O código a seguir mostra as etapas de função do mapeamento._\n",
"\n",
"``` python \n",
" \n",
" def phrase_process(query, db, config):\n",
" # ...\n",
" \n",
" NodeMapper.tokenizer(query, tokens)\n",
" NodeMapper.delete_useless(query)\n",
" NodeMapper.map(query,db)\n",
" NodeMapper.individual_rank(query)\n",
" group_ranking = NodeMapper.group_ranking(query,db)\n",
" return NodeMapper.group_ranking_to_df(group_ranking)\n",
"\n",
"\n",
"```\n",
"\n",
"The first step NodeMapper.tokenizer(query, tokens) uses the tokens loaded from the tokens.xml file and correctly maps the nodes that perform a different function than Name Nodes and Value Nodes. In this function the differentiation happens based on the syntactic function of the word and the presence of the word in the tokens file. For example, if the word is not present in tokens, and is a proper or common noun, adjective or a cardinal number, this step marks an intermediate type for the node, NTVT. This indicates to the following stages that these nodes, despite not having a final type, may be candidates for some mapping (Name Node or Value Node).\n",
"\n",
"In delete_useless all nodes that have no type (NA) are deleted. On the other hand, map, which will be shown next, searches for all schema elements and database values that can refer to \"candidate\" Name and Value Nodes within the tree.\n",
"\n",
"\n",
"\n",
"\n",
"> _A primeira etapa `NodeMapper.tokenizer(query, tokens)` utiliza os tokens carregados do arquivo `tokens.xml` e mapeia corretamente os nós que exercem uma função diferente de Name Nodes e Value Nodes. Nessa função a diferenciação acontece baseado na função sintática da palavra e na presença da palavra no arquivo de tokens. Por exemplo, se a palavra não estiver presente em tokens, e for um substantivo próprio ou comum, adjetivo ou um número cardinal, essa etapa assinala um tipo intermediario para o nó, `NTVT`. Isso indica para as fases seguintes que esse nós, apesar de não terem um tipo final, podem ser candidatos a algum mapeamento (Name Node ou Value Node)._\n",
"\n",
"> _Em `delete_useless` todos os nós que não possuem tipo (`NA`) são deletados. Já em `map`, que será mostrado a seguir, busca todos os elementos de esquema e valores de banco de dados que podem fazer referencia aos nós \"candidatos\" a Name e Value Nodes dentro da árvore._ \n",
"\n",
"\n",
"``` python\n",
" def map(query, db):\n",
" parse_tree = query.parse_tree\n",
" all_nodes = parse_tree.all_nodes\n",
"\n",
" for i in range(len(all_nodes)):\n",
" tree_node = all_nodes[i]\n",
" if tree_node.token_type == 'NTVT' or tree_node.token_type == 'JJ':\n",
" db.is_schema_exist(tree_node)\n",
" db.is_text_exist(tree_node)\n",
"\n",
" if len(tree_node.mapped_elements) == 0:\n",
" tree_node.token_type = 'NA'\n",
"\n",
" elif tree_node.token_type == 'VT':\n",
" OT = '='\n",
" if tree_node.parent.token_type == 'OT':\n",
" OT = tree_node.parent.function\n",
" elif len(tree_node.children) == 1 and tree_node.children[0].token_type == 'OT':\n",
" OT = tree_node.children[0].function\n",
" db.is_num_exist(OT, tree_node)\n",
" tree_node.token_type = 'VTNUM'\n",
"\n",
"```\n",
"\n",
"Candidates for Name Nodes are retrieved in `db.is_schema_exist` and for Value Nodes in `db.is_text_exist` and `db.is_num_exist`. \n",
"\n",
"The `individual_ranking` function is responsible for calculating the similarity between a given node and the schema elements mapped to it, while the `group_ranking` function is responsible for identifying the best mapping configuration for all nodes. This choice is made based on:\n",
"\n",
" - On how the nodes are arranged in the tree\n",
" - The relationship between the schema elements mapped to the nodes\n",
" \n",
" This relationship is calculated in the Schema Graph, using Djkistra to calculate the distance between schema elements. At the end of this process, `group_ranking` classifies the candidate `NTVT` nodes into Name Nodes (`NT`) or Value Nodes (`VT`). \n",
"\n",
"Below is an example of the output from the function that performs the node mapping.\n",
"\n",
"\n",
"> _Os candidatos para Name Nodes são recuperados em `db.is_schema_exist` e para Value Nodes em `db.is_text_exist` e `db.is_num_exist`._ \n",
">\n",
"> _A função `individual_ranking` é responsável por calcular a similaridade entre um dado nó e os elementos de esquema mapeados para o mesmo, enquanto a função ``group_ranking`` é responsável por identificar a melhor configuração de mapeamento para todos nós. Essa escolha é feita baseada em:_\n",
">\n",
"> - _Em como os nós estão dispostos na árvore_\n",
"> - _Na relação entre os elementos de esquema mapeados para os nós_\n",
">\n",
"> _Essa relação é calculada no Grafo de Esquema, usando Djkistra para calcular a distãncia entre os elementos do esquema. Ao final desse processo, `group_ranking` classifica os nós candidatos `NTVT` em Name Nodes (`NT`) ou Value Nodes (`VT`)._ \n",
">\n",
"> _Abaixo segue o exemplo da saida da função que executa o mapeamento dos nós._"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-25T23:09:38.288137Z",
"start_time": "2020-09-25T23:09:27.137697Z"
},
"id": "ZyAj-LnLL1h2"
},
"source": [
"NodeMapper.phrase_process(query,rdbms,config)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-25T23:09:38.327908Z",
"start_time": "2020-09-25T23:09:38.291260Z"
},
"id": "pQMbmg89L1h3"
},
"source": [
"query.parse_tree.show()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "EqNBdGk2L1h3"
},
"source": [
"## Entity Resolution"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hcPpxVeXL1h3"
},
"source": [
"\n",
"This module is the first phase of the Tree Strucutre Adjustor. Before generating trees following the grammar, Entity Resolution checks which nodes reference the same attribute in the tree. This relationship is used both in generating derived trees and in scoring the trees. Below is the code for the entity_resolute function, followed by its output:\n",
"\n",
"\n",
"> _Esse módulo é a primeira fase do **Tree Strucutre Adjustor**. Antes de gerar as árvores seguindo a gramática, **Entity Resolution** verifica quais nós referenciam o mesmo atributo na árvore. Essa relação é utilizada tanto na geração de árvores derivadas, quanto na pontuação das árvores. Segue abaixo o código da função ``entity_resolute``, seguido da saída da mesma:_ \n",
"\n",
"\n",
"``` python \n",
"def entity_resolute(query):\n",
" query.entities = []\n",
" nodes = query.parse_tree.all_nodes\n",
" for i in range(len(nodes)):\n",
" left = nodes[i]\n",
" if left.get_choice_map() == None:\n",
" continue\n",
" left_map = left.get_choice_map().schema_element\n",
" for j in range(i+1, len(nodes)):\n",
" right = nodes[j]\n",
" if right.get_choice_map() == None:\n",
" continue\n",
" right_map = right.get_choice_map().schema_element\n",
" if left_map == right_map:\n",
" if left.token_type == \"VTTEXT\" and left.token_type == \"VTTEXT\":\n",
" if left.label == right.label:\n",
" entity_pair = EntityPair(left, right)\n",
" query.entities.append(entity_pair)\n",
" else:\n",
" continue\n",
" if left.token_type == \"VTTEXT\" and right.token_type == \"NT\" or\\\n",
" left.token_type == \"NT\" and right.token_type == \"VTTEXT\" or\\\n",
" left.token_type == \"NT\" and right.token_type == \"NT\":\n",
" if abs(left.word_order - right.word_order) > 2:\n",
" continue\n",
" else:\n",
" entity_pair = EntityPair(left, right)\n",
" query.entities.append(entity_pair)\n",
"```"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-25T23:09:38.419847Z",
"start_time": "2020-09-25T23:09:38.329289Z"
},
"id": "XToqk2jrL1h4"
},
"source": [
"entity_resolute(query)\n",
"query.entities"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "HkAPPKoCfnQN"
},
"source": [
"## Tree Structure Adjustor"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "t-fb4TXSL1h4"
},
"source": [
"\n",
"This module is responsible for\n",
" * Verifying that the generated tree matches the specified grammar\n",
" * Generating variant trees of the original Parser Tree that follow the grammar\n",
" * Ranking all the Trees, and choosing the one with the highest score for the translation step.\n",
" \n",
"The grammar that should be followed for the trees is as follows:\n",
"\n",
"\n",
"> _Este módulo é responsável por:_\n",
"> * _Verificar se a árvore gerada condiz com a gramática especificada_\n",
"> * _Gerar Árvores variantes da Árvore de Parser original que seguem a gramática_\n",
"> * _Ranquear todas as Árvores, e escolher a com maior pontuação para a etapa de tradução._\n",
"> \n",
"> _A gramática que deve ser seguida para as árvores é a seguinte:_\n",
"\n",
"```\n",
"Q -> (SClause)(ComplexCondition)*\n",
"SClause -> SELECT + GNT\n",
"ComplexCondition -> ON + (leftSubtree*rightSubtree)\n",
"leftSubtree -> GNP\n",
"rightSubtree -> GNP |VN | MIN | MAX\n",
"GNP -> (FN + GNP) | NP\n",
"NP -> NN + (NN)*(Condition)*\n",
"Condition-> VN | (ON + VN)\n",
"```\n",
"\n",
"Where: \n",
"* VN, NN, FN, ON are node types referenced in [node type table](#type_no)\n",
"* `+` represents parent-child relationships between nodes\n",
"* `*` represents sibling relationships between nodes\n",
"\n",
"The `tree_structure_adjust` function, listed below, is responsible for performing the tree generation process.\n",
"\n",
"\n",
"> _Em que:_ \n",
">* _VN, NN, FN, ON são tipos de nós referenciados na [tabela de tipos de nós](#tipo_no)_\n",
">* _`+` representa relações de pai-filho entre os nós_\n",
">* _`*` representa relações de irmão entre os nós_\n",
">\n",
"> _A função `tree_structure_adjust`, listada a seguir, é a responsável por executar o processo de geração das árvores._\n",
"\n",
"``` python\n",
" \tdef tree_structure_adjust(query, db):\n",
"\t\tquery.adjusting_trees = []\n",
"\t\tquery.invalid = []\n",
"\t\tpre_trees = {}\n",
" \n",
"\t\tTreeStructureAdjustor.adjust(query, db,False, pre_trees)\n",
"\t\tif len(query.adjusting_trees) == 0 or (len(query.adjusting_trees) > 0 and query.adjusting_trees[0].cost > 3):\n",
"\t\t\tmax_min = False\n",
"\t\t\tfor node in query.parse_tree.all_nodes:\n",
"\t\t\t\tif node.function == 'max' or node.function == 'min':\n",
"\t\t\t\t\tmax_min = True\n",
"\t\t\t\t\tbreak\n",
"\n",
"\t\t\tif max_min:\n",
"\t\t\t\tTreeStructureAdjustor.adjust(query, db, True, pre_trees)\n",
"\n",
"\t\tadjusted_trees = query.adjusting_trees\n",
"\t\tadjusted_trees.sort(key=lambda elem : ((elem.weight * 100) - elem.cost) * -1 )\n",
"\t\tfor i in range(len(adjusted_trees)):\n",
"\t\t\tadjusted_tree = adjusted_trees[i]\n",
"\n",
"\t\t\tfor j in range(len(adjusted_tree.all_nodes)):\n",
"\t\t\t\tnode = adjusted_trees[i].all_nodes[j]\n",
"\t\t\t\tnode.children.sort(key=lambda elem: elem.node_id)\n",
"\t\t\thash(adjusted_tree)\n",
"\n",
"\n",
"\t\tlinked_list = []\n",
"\t\ti = 0\n",
"\t\twhile i < len(adjusted_trees):\n",
"\t\t\tif adjusted_trees[i].hash_num in linked_list:\n",
"\t\t\t\tadjusted_trees.pop(i)\n",
"\t\t\t\ti-=1\n",
"\t\t\telse:\n",
"\t\t\t\tlinked_list += [adjusted_trees[i].hash_num]\n",
"\t\t\ti+=1\n",
"\n",
"\t\tTreeStructureAdjustor.build_adjusted_trees(query)\n",
"```\n",
"\n",
"\n",
"As we can see, we first generate a series of trees, called `adjusting_trees`. After sorting these intermediate trees, we sort all the children of all the nodes, and remove the repeated trees from the list. From there, we generate the final list of trees, in the `build_adjusted_trees` method. Next, we check the final output of the method.\n",
"\n",
"\n",
"> _Como podemos observar, primeiramente é gerado uma série de árvores, chamadas `adjusting_trees`. Depois de ordenarmos estas árvores intermediárias, ordenamos todos os filhos de todos os nós, e retiramos as árvores repetidas da lista. A partir dai, geramos a lista final de árvores, no método `build_adjusted_trees`. A seguir, verificamos a saída final do método._"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-25T23:09:39.566174Z",
"start_time": "2020-09-25T23:09:38.422232Z"
},
"id": "If8BqoNIL1h4"
},
"source": [
"TreeStructureAdjustor.tree_structure_adjust(query,rdbms)\n",
"query.query_tree.show()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "s2oGHNgFfRBd"
},
"source": [
"## Explainer"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pzMm6EOuL1h5"
},
"source": [
"\n",
"This module, is the last step of the **Tree Strucuture Adjustor**. After selecting the highest scoring tree, this module inserts nodes in order to facilitate the translation of the natural language query into SQL. Another goal of this step is to remove ellipses from the query. The code asseguir, shows that this function just iterates over the nodes in the tree and checks if it can insert other nodes that are already in the tree, into one of its child nodes.\n",
"\n",
"> _Este módulo, é a última etapa de **Tree Strucuture Adjustor**. Após selecionar a árvore com maior pontuação, este módulo insere nós com o intuito de facilitar a tradução da consulta para linguagem natural para a SQL. Outro objetivo desta etapa é retirar elipses da consulta. O código asseguir, mostra que esta função apenas itera sobre os nós da árvore e verifica se pode inserir outros nós que já estão na árvore, em um dos seus nós filhos._\n",
"\n",
"```python \n",
" def explain(query):\n",
" for i in query.adjusted_trees:\n",
" nl = explain_tree(i)\n",
" query.nl_sentence.append(nl)\n",
"```"
]
},
{
"cell_type": "code",
"metadata": {
"id": "A-Tyu5WxL1h5"
},
"source": [
"explain(query)\n",
"query.query_tree.show()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "BOZ8tW7QfHWX"
},
"source": [
"## SQL Translator"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3HtuPqtfL1h5"
},
"source": [
"\n",
"The last module of the NaLIR architecture is responsible for taking the query tree (`query_tree`) generated by the **Query Tree Strucutre Adjustor** and returning the corresponding SQL. The `translate` function, listed below, is responsible for performing this process inside the module. \n",
"\n",
"As we see in the code the structure used to convert the tree into an SQL string is a structure called `block`. Each `block` identifies one query level. Thus, queries that have no subqueries, have only one block. Queries that have aggregation, or nested queries, have two or more blocks.\n",
"\n",
"> _O último módulo da arquitetura do NaLIR, é responsável por receber a Árvore Consulta (`query_tree`), gerada pelo **Query Tree Strucutre Adjustor** e retornar a SQL correspondente. A função `translate`, listada a seguir, é a responsável por executar esse processo dentro do módulo._ \n",
"\n",
"> _Como vemos no código a estrutura utilizada para converter a árvore em uma string SQL é uma estrutura chamada `block`. Cada `block` identifica um nivel de consulta. Assim, consultas que não possuem subconsultas, possuem apenas um bloco. Consultas que possuem agregação, ou consultas aninhadas, possuem dois ou mais blocos._\n",
"\n",
"```python\n",
"def translate(query, db):\n",
" pre_structure_adjustor(query)\n",
" if len(query.query_tree.all_nodes) < 2:\n",
" return\n",
" query.blocks = []\n",
" block_split(query)\n",
" query.blocks[0].node_edge_gen(query.main_block, query.query_tree, query.graph)\n",
" query.blocks[0].translate(query.main_block, query.query_tree)\n",
" query.translated_sql = query.blocks[0].sql\n",
"```"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-25T23:09:39.725094Z",
"start_time": "2020-09-25T23:09:39.638591Z"
},
"id": "5ooOlrvEL1h6"
},
"source": [
"translate(query, rdbms)\n",
"print(query.translated_sql)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "1CPPLkZ8L1h6"
},
"source": [
"## Examples (~2 mins)"
]
},
{
"cell_type": "code",
"metadata": {
"id": "2TgOhCQYL1h6"
},
"source": [
"queries = [\n",
" 'return me the homepage of PVLDB.',\n",
" 'return me the homepage of \"H. V. Jagadish\".',\n",
" 'return me the abstract of \"Making database systems usable\".',\n",
" 'return me the year of \"Making database systems usable\"',\n",
" 'return me all the keywords.',\n",
" 'return me the number of the organizations in \"North America\".',\n",
" 'return me the number of keywords in VLDB conference.',\n",
" 'return me the number of the organizations.',\n",
" 'return me the number of authors of \"Making database systems usable',\n",
" 'return me the area of the VLDB conference.',\n",
" 'return me the organization \"H. V. Jagadish\" is in.',\n",
" 'return me all the organizations in Databases area.'\n",
"]\n",
"\n",
"sql_queries = []\n",
"for query in queries:\n",
" sql_query = run_query(query, rdbms, config)\n",
" sql_queries += [sql_query]\n",
"\n",
"sql_queries"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "DMwdbQ1aL1h7"
},
"source": [
" ## Try it yourself!"
]
},
{
"cell_type": "code",
"metadata": {
"id": "-SkLuTBPL1h7"
},
"source": [
"query = 'return me all the organizations in Databases area.' # ponha sua consulta aqui\n",
"run_query(query, rdbms, config)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "o_gbZv2jL1h7"
},
"source": [
""
],
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment