Skip to content

Instantly share code, notes, and snippets.

@jkot
Last active July 14, 2016 18:45
Show Gist options
  • Save jkot/3f9363b9aca5b7d3721e to your computer and use it in GitHub Desktop.
Save jkot/3f9363b9aca5b7d3721e to your computer and use it in GitHub Desktop.
Virtuoso

Virtuoso tips and tricks

Virtuoso settings for retrieving large result sets

[Parameters]
MaxSortedTopRows                = 250000
 
[SPARQL]
ResultSetMaxRows                = 10000000
MaxQueryCostEstimationTime      = 36000    ; in seconds
MaxQueryExecutionTime           = 36000    ; in seconds

Working with constraints DBpedia's SPARQL endpoint MaxSortedTopRows Limits via LIMIT & OFFSET

The DBpedia SPARQL endpoint is configured with the following INI setting:

`MaxSortedTopRows = 40000`

The setting above sets a threshold for sorted rows.

To circumvent the limitation, use subqueries:

To prevent the problem outlined above you can leverage the use of subqueries which make better use of temporary storage associated with this kind of quest. An example would take the form:

SELECT ?p ?s 
WHERE 
  {
    {
      SELECT DISTINCT ?p ?s 
      FROM <http://dbpedia.org> 
      WHERE   
        { 
          ?s ?p <http://dbpedia.org/resource/Germany> 
        } ORDER BY ASC(?p) 
    }
  } 
OFFSET 50000 
LIMIT 1000

Loading

cd virtuosodb/bin
rlwrap ./isql
SQL> ld_dir ('/tmp', 'ids.nt', 'http://jakub');
SQL> select * from load_list;
SQL> rdf_loader_run();

Optimizations

Use SSDs, otherwise it chokes on IO. Use multiple stripes (but have to know the size of the loaded db in advance). Use a lot of RAM.

Get results in a given format

TSV

curl -F "format=text/tab-separated-values" -F "query=@query.sparql" http://localhost:8890/sparql > result.tsv

N-Triples

curl -F "format=text/plain" -F "query=@query.sparql" http://localhost:8890/sparql > result.nt

or -F "format=text/ntriples"

Turtle

-F "format=text/n3"

Dump a graph

This doesn't work: http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtRDFDatasetDump cause:

*** Error 22023: [Virtuoso Driver][Virtuoso Server]SR601: Argument 1 of http_ttl_triple() should be an array of special format at line 1 of Top-Level: dump_one_graph ('http://example.org/news', '/example/dump/data_', 1000000000)

Fixing env by replacing it with vector (0, 0, 0) removes the problem. Also http_ttl_triple can be replaced by http_nt_triple to get N-Triples:

CREATE PROCEDURE dump_one_graph 
  ( IN  srcgraph           VARCHAR  , 
    IN  out_file           VARCHAR  , 
    IN  file_length_limit  INTEGER  := 1000000000
  )
  {
    DECLARE  file_name     VARCHAR;
    DECLARE  env, 
             ses           ANY;
    DECLARE  ses_len, 
             max_ses_len, 
             file_len, 
             file_idx      INTEGER;
    SET ISOLATION = 'uncommitted';
    max_ses_len  := 10000000;
    file_len     := 0;
    file_idx     := 1;
    file_name    := sprintf ('%s%06d.ttl', out_file, file_idx);
    string_to_file ( file_name || '.graph', 
                     srcgraph, 
                     -2
                   );
    string_to_file ( file_name, 
                     sprintf ( '# Dump of graph <%s>, as of %s\n@base <> .\n', 
                               srcgraph, 
                               CAST (NOW() AS VARCHAR)
                             ), 
                     -2
                   );
    env := vector (0, 0, 0);
    ses := string_output ();
    FOR (SELECT * FROM ( SPARQL DEFINE input:storage "" 
                         SELECT ?s ?p ?o { GRAPH `iri(?:srcgraph)` { ?s ?p ?o } } 
                       ) AS sub OPTION (LOOP)) DO
      {
        http_nt_triple (env, "s", "p", "o", ses);
        ses_len := length (ses);
        IF (ses_len > max_ses_len)
          {
            file_len := file_len + ses_len;
            IF (file_len > file_length_limit)
              {
                http (' .\n', ses);
                string_to_file (file_name, ses, -1);
		gz_compress_file (file_name, file_name||'.gz');
		file_delete (file_name);
                file_len := 0;
                file_idx := file_idx + 1;
                file_name := sprintf ('%s%06d.ttl', out_file, file_idx);
                string_to_file ( file_name, 
                                 sprintf ( '# Dump of graph <%s>, as of %s (part %d)\n@base <> .\n', 
                                           srcgraph, 
                                           CAST (NOW() AS VARCHAR), 
                                           file_idx), 
                                 -2
                               );
                 env := VECTOR (dict_new (16000), 0, '', '', '', 0, 0, 0, 0, 0);
              }
            ELSE
              string_to_file (file_name, ses, -1);
            ses := string_output ();
          }
      }
    IF (LENGTH (ses))
      {
        http (' .\n', ses);
        string_to_file (file_name, ses, -1);
	gz_compress_file (file_name, file_name||'.gz');
	file_delete (file_name);
      }
  }

Loading DbPedia / Freebase

https://joernhees.de/blog/2014/11/10/setting-up-a-local-dbpedia-2014-mirror-with-virtuoso-7-1-0/

@jindrichmynarz
Copy link

Also the calls to string_to_file outputting @base directive should be removed, because @base is not allowed in N-Triples.

@jindrichmynarz
Copy link

Finally, the trailing dot produced by http (' .\n', ses); needs to be removed to produce valid N-Triples. I've published my version of this procedure to https://gist.github.com/jindrichmynarz/da9b8fe13329ff7d2985 (hopefully, it's about right now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment