Skip to content

Instantly share code, notes, and snippets.

@reznikmm
Last active February 21, 2023 09:02
Show Gist options
  • Save reznikmm/f8696dab793fb50c8fc5bbe9fc7ce740 to your computer and use it in GitHub Desktop.
Save reznikmm/f8696dab793fb50c8fc5bbe9fc7ce740 to your computer and use it in GitHub Desktop.

Introduction to VSS library

The VSS library is designed to provide advanced string and text processing capabilities. The concept behind the new string library is to offer a convenient and robust API that allows developers to work with Unicode text, regardless of its internal representation. In this article, we will introduce you to the library and explain its purpose, highlighting its usefulness for developers working in this area.

What is the rationale behind creating another library for string processing?

Although Ada offers several standard string types, and there are several libraries developed by the Ada community, each one has its own drawbacks or limitations.

The String, Wide_String, and Wide_Wide_String types are indefinite, which can be inconvenient when storing string values in an object or container. The Unbounded_String, Unbounded_Wide_String, and Unbounded_Wide_Wide_String types are definite, but their set of provided operations is limited, and dot notation is not available for them.

Furthermore, each type is restricted to a specific character set, necessitating the conversion of the character set when reading, writing, or interacting with external sources. String and Unbounded_String types only support Latin-1, while wide types use 2 or 4 bytes per character, even for ASCII.

Unfortunately, the most commonly used encoding, UTF-8, is not natively supported by any of these types. The UTF8_String type attempts to fill this gap, but it breaks the user's expectation that each element is a character and places the burden and complexity of working with the encoding on the user.

In modern times, text is not merely a sequence of characters but rather consists of grapheme clusters, words, and lines, as defined by the Unicode standard. As a result, tasks such as comparing and sorting strings (collation) and case conversion cannot be performed solely at the level of individual characters. For example To_Upper ("ß") = "SS". The standard library does not provide support for this.

To overcome these issues, the VSS library:

  • provides a definite type to represent a Unicode character string with a convenient set of operations. A dedicated string vector type with an efficient implementation.
  • provides an encoding-agnostic API that allows efficient implementations tailored to the platform or application.
  • offers a comprehensive range of string and string vector operations, comparable to those found in other programming languages.
  • takes advantage of more modern language features and technologies, offering improved performance, memory usage, or other benefits.

Getting Started

The library can be found on GitHub and is distributed under the Apache 2.0 license. It can be built using an Ada 2022 compliant compiler. Additionally, it is possible to use Alire to build the library.

git clone https://github.com/AdaCore/VSS.git
cd VSS
alr build

The VSS library is divided into multiple projects:

  • vss_text.gpr - base string library with
    • Unicode string, string vector, byte vector types
    • input/output text streams to read/write files, memory and stdin/stdout
    • iterators for characters, grapheme clasters, words and lines
    • encoders and decoders for several of the most popular text encodings
  • vss_regexp.gpr - a regular expression engine
  • vss_json.gpr - a JSON streaming API that allows for efficient parsing and composing of JSON content on the fly
  • vss_xml.gpr - a XML streaming API implemented over XMLAda or Matreshka libraries
  • vss_xml_templates.gpr - a XML template engine inspired by Zope Page Templates

How about giving the VSS string library a try?

First steps with VSS

We starts with creating a sample Alire crate and adding VSS as a dependency:

alr init --bin vss_test
cd vss_test
alr pin vss --use=PATH_TO_VSS_FOLDER
# or you can use a Git repository link:
alr pin vss --use=https://github.com/AdaCore/VSS.git --branch=master

Then we modify vss_test.adb to the following code:

pragma Wide_Character_Encoding (UTF8);

with VSS.Strings;
with VSS.Strings.Conversions;
with Ada.Wide_Wide_Text_IO;

procedure Vss_Test is
   Text : VSS.Strings.Virtual_String := "𝛼−𝛽";
begin
   Ada.Wide_Wide_Text_IO.Put_Line
    (VSS.Strings.Conversions.To_Wide_Wide_String (Text));
end Vss_Test;

The first line specifies to GNAT that the source code representation will use UTF-8 encoding. Then we add VSS library units and Wide_Wide_Text_IO package. The Text variable initialization leverages Ada 2022 syntax for user defined literals. It hides a call to VSS.Strings.To_Virtual_String for the string literal. The explicit call is required for converting back to a string.

To build and execute this code just run:

alr run

Having Text we can:

  • find if it's empty: Text.Is_Empty
  • find text's length in characters: Text.Character_Length
  • find text's hash: Text.Hash
  • check is it starts (or ends) with other string: Text.Starts_With ("𝛼")
  • change character cases: Text.To_Uppercase
  • etc.

We can modify Text by

  • Appending string or character: Text.Append ('.');
  • Prepending string or character: Text.Prepend (">>>");
  • Erasing: Text.Clear;
  • etc.

We can split Text to a string vector (defined in VSS.String_Vectors):

declare
   List : VSS.String_Vectors.Virtual_String_Vector := Text.Split ('');
begin
   for Item of List loop
      Ada.Wide_Wide_Text_IO.Put_Line
        (VSS.Strings.Conversions.To_Wide_Wide_String (Item));
   end loop;
end;

A dedicated function Text.Split_Lines split the text to a string vector using specified line separator. Conversely, the vector type offers the Join and Join_Lines functions for the opposite operations.

Conclusion

In conclusion, the VSS library provides advanced string and text processing capabilities. It offers an API that allows developers to work with Unicode text, regardless of its internal representation. The library overcomes the limitations of Ada's standard string types and other community-developed string libraries. It provides a definite type to represent a Unicode character string with a comprehensive range of operations comparable to those found in other programming languages. Additionally, it is encoding-agnostic, allowing efficient implementations tailored to the platform or application. The library is divided into multiple projects, with each project catering to a specific need. The VSS library is distributed under the Apache 2.0 license, and it can be built using an Ada 2022 compliant compiler. With its efficient implementation, modern language features and technologies, and support for tasks such as comparing and sorting strings, the VSS library is a useful tool for developers working with strings and text processing.

In the subsequent articles, we will explore more advanced concepts such as cursors, streams, encoders/decoders, and so on. Stay tuned!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment