Skip to content

Instantly share code, notes, and snippets.

@binarytemple
Forked from angrycub/QueryingRiakSearch2.0.md
Created October 17, 2017 10:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save binarytemple/9dcb51c2249dad03872114976208da39 to your computer and use it in GitHub Desktop.
Save binarytemple/9dcb51c2249dad03872114976208da39 to your computer and use it in GitHub Desktop.
Querying Riak Search 2.0

This is still a WIP.

Advanced MR using Riak Search 2.0 instead...

Prerequisites

Enable Search

While the node is stopped, modify the riak.conf file. Add the line

search = on

to the end of the file, or find and replace the value within the file. Once search is enabled on all of the nodes in your cluster, start them up. If the nodes fail to start with search enabled, verify that you have a Java Virtual Machine loaded and that the java command works as expected.

Create a Schema for the Data

Since the load script does not use column names that are compatible with the default schema and since using the default schema is not recommended for a production system anyhow, we are going to make an appropriate custom schema for the goog.csv data.

Start with the Skeleton Schema

Start with the skeleton Search schema documented in the Search Schema. Create a document in your working folder named goog.xml and insert the following content:

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="schedule" version="1.5">
 <fields>

   <!-- All of these fields are required by Riak Search -->
   <field name="_yz_id"   type="_yz_str" indexed="true" stored="true"  multiValued="false" required="true"/>
   <field name="_yz_ed"   type="_yz_str" indexed="true" stored="false" multiValued="false"/>
   <field name="_yz_pn"   type="_yz_str" indexed="true" stored="false" multiValued="false"/>
   <field name="_yz_fpn"  type="_yz_str" indexed="true" stored="false" multiValued="false"/>
   <field name="_yz_vtag" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
   <field name="_yz_rk"   type="_yz_str" indexed="true" stored="true"  multiValued="false"/>
   <field name="_yz_rt"   type="_yz_str" indexed="true" stored="true"  multiValued="false"/>
   <field name="_yz_rb"   type="_yz_str" indexed="true" stored="true"  multiValued="false"/>
   <field name="_yz_err"  type="_yz_str" indexed="true" stored="false" multiValued="false"/>
 </fields>

 <uniqueKey>_yz_id</uniqueKey>

 <types>
    <!-- YZ String: Used for non-analyzed fields -->
    <fieldType name="_yz_str" class="solr.StrField" sortMissingLast="true" />
 </types>
</schema>

Add FieldTypes to the Schema

We will need to configure some additional Solr FieldTypes that we will use for our fields. Inside the <types> element, add the following types:

    <fieldtype name="date" class="solr.TrieDateField" />
    <fieldtype name="integer" class="solr.TrieLongField" />
    <fieldtype name="float" class="solr.TrieFloatField" />

    <!-- Catch-all Field Type -->
    <fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />

These types will be used in our field definitions that will be made in the next step.
The "catch-all" field type will be used on a Dynamic Field. Fields that are in the object, but not in the schema, will match the catch-all field definition and be ignored rather than throwing an error.

Adding Fields to the Schema

Next, we will define Solr Fields. These field definitions are used to tell Search what to do with incoming data—to index it, store it, or ignore it.

Inside the <fields> element, add the following types:

   <field name="date"     type="date"    indexed="true" stored="false" multiValued="false" />
   <field name="open"     type="float"   indexed="true" stored="false" multiValued="false" />
   <field name="high"     type="float"   indexed="true" stored="false" multiValued="false" />
   <field name="low"      type="float"   indexed="true" stored="false" multiValued="false" />
   <field name="close"    type="float"   indexed="true" stored="false" multiValued="false" />
   <field name="volume"   type="integer" indexed="true" stored="false" multiValued="false" />

   <!-- Catch-all field -->
   <dynamicField name="*" type="ignored"  />

Configure Riak

Set up the shell environment

This just makes the following commands more copy-and-paste-able. Run the following commands in your shell, substituting the correct values where necessary.

export RIAK_HOST="http://localhost"
export RIAK_PORT=8098
export SOLR_PORT=8093

Load the Schema

curl -XPUT "$RIAK_HOST:$RIAK_PORT/search/schema/goog" \
 -H 'Content-Type:application/xml' --data-binary @goog.xml

Create the Search Index

curl -XPUT "$RIAK_HOST:$RIAK_PORT/search/index/goog" \
 -H 'Content-Type:application/json' -d '{"schema":"goog"}'

Add the Index to the goog Bucket

curl -XPUT "$RIAK_HOST:$RIAK_PORT/buckets/goog/props" -H'content-type:application/json' -d'{"props":{"search_index":"goog"}}'

Modify the Load Script

There are a few fixes that will ned to be made to the original load_data.erl file. First, if you do not have erlang installed on your machine, you will need to change the path in the first line of the file to point to the embedded Erlang included with Riak.

Change

#!/usr/bin/env escript

to

#!/usr/lib64/riak/erts-5.10.3/bin/escript

if you are on CentOS or RHEL, or

#!/usr/lib/riak/erts-5.10.3/bin/escript

for Ubuntu, Debian, and FreeBSD

Next, we need to make some changes to the curl that gets called by the the load script. We need to change the URL so that it reflects our Riak IP Address and Port and then create Java style dates so that they will be parseable by Riak Search. Change line 9 from

    JSON = io_lib:format("{\"Date\":\"~s\",\"Open\":~s,\"High\":~s,\"Low\":~s,\"Close\":~s,\"Volume\":~s,\"Adj. Close\":~s}", Line),

to

    JSON = io_lib:format("{\"date\":\"~sT00:00:00Z\",\"open\":~s,\"high\":~s,\"low\":~s,\"close\":~s,\"volume\":~s,\"adj_close\":~s}", Line),

Modify line 10 and correct the IP address and port if necessary—if you're using a devrel, for example:

    Command = io_lib:format("curl -X PUT http://127.0.0.1:8091/riak/goog/~s -d '~s' -H 'content-type: application/json'", [hd(Line),JSON]),

Load the Data

./load_data goog.csv

Running Example Queries

In this section, we will run some of the sample queries that were provided in the "Advanced MapReduce - Bigger Data Examples" documentation against the sample dataset.

Select days where the high was over $600

http://10.0.1.19:8098/search/query/goog?wt=json&q=high:[600%20TO%20*]

or

curl -g "$RIAK_HOST:$RIAK_PORT/search/query/goog?wt=json&q=high:[600%20TO%20*]" | jsonpp

Select days where the there was a loss.

curl "$RIAK_HOST:$RIAK_PORT/search/query/goog?q=*:*&fq=%7B%21frange+u%3D0%7Dsub%28close%2Copen%29&wt=json" | jsonpp

or

http://10.0.1.19:8098/search/query/goog?q=*:*&fq={!frange%20u=0}sub(close,open)&wt=json

Deleting an Index

Before deleting an index, it is important to note that there is no way to rebuild an index in realtime without clearing the AAE trees and waiting for Active Anti-Entropy to repair them.

curl -XPUT "$RIAK_HOST:$RIAK_PORT/buckets/goog/props" -H'content-type:application/json' -d'{"props":{"search_index":"_dont_index_"}}'
curl -XDELETE "$RIAK_HOST:$RIAK_PORT/search/index/goog"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment