Skip to content

Instantly share code, notes, and snippets.

@angstyloop
Last active June 15, 2023 03:33
Show Gist options
  • Save angstyloop/2281191a3e7fd7e4c615698661fbac24 to your computer and use it in GitHub Desktop.
Save angstyloop/2281191a3e7fd7e4c615698661fbac24 to your computer and use it in GitHub Desktop.
GTK4 demo app: search with mismatches. Search example strings with MAX_MISMATCHES number of mismatched characters (i.e. substitution errors). The search and preprocess functions were modified from the example code from "Fast and Practical Approximate String Matching" by Ricardo A. Baeza-Yates and Chris H. Perleberg.
/* search_with_mismatches.c
*
* COMPILE
*
* gcc `pkg-config --cflags gtk4` -o search_with_mismatches search_with_mismatches.c `pkg-config --libs gtk4`
*
* RUN
*
* search_with_mismatches
*
*/
#include <gtk/gtk.h>
#include <stdio.h>
/* Padding constant
*/
#define PADDING_ZERO_PIXELS 0
/* Maximum number of mismatches we allow.
*/
#define MAX_MISMATCHES 3
/*
* Search for fuzzy matches of a pattern in text with @k or fewer mismatches
* (substitutions). Uses doubly-linked list GList from GLib.
*
* COMPILE
*
* gcc `pkg-config --cflags glib-2.0` -o g_match_with_mismatches g_match_with_mismatches.c `pkg-config --libs glib-2.0`
*
* RUN
*
* ./match_with_mismatches
*
* REFS
*
* The search and preprocess functions were taken from the example code from
* "Fast and Practical Approximate String Matching" by Ricardo A. Baeza-Yates
* and Chris H. Perleberg. I modified the code to make it fit better into
* a GLib application.
*/
/* Size of alphabet index array. This number should be twice the size
* of the alphabet, which is 128 in this case for ASCII (not extended).
* The purpose of this extra space is explained later.
*/
#define ALPHABET_SIZE 256
/* The Match object will hold the index @i of the first character of the match
* and the number of mismatched characters @k within that match. The Match
* objects are also linked list nodes.
*/
typedef struct Match Match;
struct Match {
int i;
int k;
gchar *text;
gchar *pattern;
int pattern_length;
};
Match *
Match_new( int i, int k, gchar *text, gchar *pattern )
{
Match *t;
t = g_malloc( sizeof( Match ) );
t->i = i;
t->k = k;
g_assert( text );
t->text = g_strdup( text );
t->pattern = pattern;
t->pattern_length = strlen( pattern );
return t;
}
void
Match_free( gpointer match_, gpointer user_data )
{
Match *match = (Match *) match_;
if ( match ) {
g_free( match->text );
g_free( match );
}
}
void
Match_print( gpointer match_, gpointer user_data )
{
Match *match = (Match *) match_;
printf( "position: %d, errors: %d, text: \"%s\", pattern: \"%s\", pattern_length: %d\n",
match->i, match->k, match->text, match->pattern,
match->pattern_length );
}
/* An array of lists. There are 128 linked lists in total - one for each
* character in our 128-character alphabet. The last 128 lists characters are
* added to the linked lists as needed. Using a power of lets us quickly access
* the array in a cyclic manner by modding using the bitwise & operator.
*
* Each list will contain the offsets for each occurence of that character, or a
* single placeholder offset of -1 if no occurences are found.
*
* The offset is the distance of the character to the left of the end of
* pattern, (i.e., measured by counting to the left from the end of pattern),
* for a given character and a given instance of pattern.
*/
GList *alpha[ALPHABET_SIZE];
/* This function initializes the @alpha and @count arrays to look like this:
*
* alpha = [ [ -1 ] ] * ALPHABET_SIZE where the inner brackets are a GList.
*
* count = [m] * m
*
* @alpha will be an array of linked lists. Characters in the pattern
* that do not occur or that occur exactly once in the text will have
* corresponding linked lists with length one. Characters in the pattern that
* occur in the text more than once will have corresponding linked lists with
* length greater than one.
*
* The first m - 1 elements of @count will be skipped on the first iteration of
* the cyclic array (since no match can be shorter than the @pattern). Note that
* the values in @count are reset to m once they are no longer needed, until the
* next loop around @count.
*
* @p - pattern string
* @m - pattern length
* @alpha - array of GList. See above.
* @count - circular buffer for counts of matches
*/
void
preprocess( char *p, int m, GList *alpha[], int count[], int max_pattern_size )
{
int i, j;
for ( i = 0; i < ALPHABET_SIZE; i++ ) {
alpha[i] = NULL;
alpha[i] = g_list_append( alpha[i], GINT_TO_POINTER( -1 ) );
}
for ( i = 0, j = 128; i < m; i++, p++ ) {
if ( GPOINTER_TO_INT( alpha[*p]->data ) == -1 )
alpha[*p]->data = GINT_TO_POINTER( m - i - 1 );
else
alpha[*p] = g_list_append( alpha[*p],
GINT_TO_POINTER( m - i - 1 ) );
}
for ( i = 0; i < max_pattern_size; i++ )
count[i] = m;
}
void
increment_offset( gpointer off_, gpointer args_ )
{
gpointer *args = (gpointer *) args_;
int i = GPOINTER_TO_INT( args[0] );
int max_pattern_size = GPOINTER_TO_INT( args[1] );
int *count = (int *) args[2];
gint off = GPOINTER_TO_INT( off_ ) ;
count[(i + off) % max_pattern_size]--;
}
gint
match_compare( gconstpointer a_, gconstpointer b_ )
{
Match *a = (Match *) a_;
Match *b = (Match *) b_;
if ( a->k == b->k )
return 0;
else if (a->k < b->k )
return -1;
else return 1;
}
/* Find the position of the first character and number of mismatches of every
* fuzzy match in a string @t with @k or fewer mismatches. Uses the array of
* GList @alpha and the array of counts @count prepared by the preprocess
* function.
* @t - text string
* @n - length of text string
* @m - length of the pattern used to create @alpha and @count
* @k - maximum number of allowed mismatches
* @alpha - array of GList. See above.
* @count - circular buffer for counts of matches
*/
GList *
search( char *t, int n, int m, int k, GList *alpha[], int count[], char *pattern, int max_pattern_size )
{
int i, off, j;
Match *match;
GList *l0 = NULL, *l1 = NULL;
char *text = t;
/* Walk the text @t, which has length @n.
*/
for ( i = 0; i < n; i++ ) {
/* If the current character in @t is in pattern, its
* corresponding list in @alpha will have a non-negative offset,
* thanks to the workdone by the preprocess function. If so, we
* need to decrement the counts in the circular buffer @count
* corresponding to the index of the character in the text and
* the offsets the lists corresponding to those characters,
* which the preprocess function prepared.
*
* Note that we will only ever need m counts at a time, and
* we reset them to @m when we are done with them, in case
* they are needed when the text wraps 256 characters.
*/
l0 = alpha[*t++];
off = GPOINTER_TO_INT( l0->data );
if ( off >= 0 ) {
gpointer t[3] = {
GINT_TO_POINTER( i ),
GINT_TO_POINTER( max_pattern_size ),
(gpointer) count,
};
g_list_foreach( l0, increment_offset, t );
}
/* If the count in @count corresponding to the current index in
* the text is no greater than @k, the number of mismatches we
* allow, then the pattern instance is reported as a fuzzy
* match. The position of the first letter in the match is
* calculated using the pattern length and the index of the last
* character in the match The number of mismatches is calculated
* from the number of matches.
*/
if ( i >= m - 1 && count[i % max_pattern_size] <= k ) {
printf( "i: %d, m: %d\n", i, m );
g_assert( i - m + 1 >= 0 );
g_assert( text );
match = Match_new( i - m + 1, count[i % max_pattern_size], text, pattern );
l1 = g_list_append( l1, match );
}
/* The count in @count corresponding to the current index in
* text is no longer needed, so we reset it to @m until we
* need it on the next wraparound.
*/
count[i % max_pattern_size] = m;
}
/* Sort by increasing k.
*/
l1 = g_list_sort( l1, match_compare );
return l1;
}
/* Generates a list of strings we can use to test this code.
*/
GList *
get_strings()
{
GList *strings = NULL;
strings = g_list_append( strings, "a" );
strings = g_list_append( strings, "ab" );
strings = g_list_append( strings, "abc" );
strings = g_list_append( strings, "abcd" );
strings = g_list_append( strings, "aa" );
strings = g_list_append( strings, "abab" );
strings = g_list_append( strings, "abcabc" );
strings = g_list_append( strings, "abcdabcd" );
strings = g_list_append( strings, "eeeeeee" );
return strings;
}
void
search_append_string( gpointer data, gpointer user_data )
{
GtkWidget *label, *list_box;
gchar *str;
str = (gchar *) data;
label = gtk_label_new( str );
list_box = GTK_WIDGET( user_data );
gtk_box_append( GTK_BOX( list_box ), label );
}
GtkWidget *
create_initial_list_box()
{
GtkWidget *initial_list_box, *label;
gchar* markup;
GList *strings;
initial_list_box = gtk_box_new( GTK_ORIENTATION_VERTICAL, PADDING_ZERO_PIXELS );
label = gtk_label_new( NULL );
markup = "<b>All Results</b>";
gtk_label_set_markup( GTK_LABEL( label ), markup );
gtk_box_append( GTK_BOX( initial_list_box ), label );
strings = get_strings();
g_list_foreach( strings, search_append_string, initial_list_box );
return initial_list_box;
}
void
append_if_match( gpointer data, gpointer user_data )
{
GList *first, *match;
gchar *haystack, *needle;
GList **list_ptr;
int n, m, k;
list_ptr = (GList **) user_data;
haystack = (gchar *) data;
first = g_list_first( *list_ptr );
needle = (gchar *) first->data;
gssize haystack_len = -1;
n = strlen( haystack );
int count[n];
m = strlen( needle );
k = MAX_MISMATCHES;
preprocess( needle, m, alpha, count, n );
match = search( haystack, n, m, k, alpha, count, needle, n );
if ( g_list_length( match ) )
*list_ptr = g_list_append( *list_ptr, match );
}
GList *
find_strings_with_substring( GList *haystacks, gchar *needle )
{
GList *found = NULL;
found = g_list_append( found, needle );
g_list_foreach( haystacks, append_if_match, &found );
found = g_list_remove( found, (gconstpointer) needle );
return found;
}
void
markup_in_string_by_match( gpointer match_, gpointer markup_ )
{
Match *match;
GString *markup;
match = (Match *) match_;
markup = (GString *) markup_;
//g_string_insert( markup, match->i, "<b>" );
//g_string_insert( markup, match->i + match->pattern_length, "</b>" );
}
gchar *
get_markup_from_match( GList *match )
{
const gchar *raw_str;
GString *markup;
gchar *str;
Match *first = (Match *) match->data;
markup = g_string_new( g_strdup( first->text ) );
g_list_foreach( match, markup_in_string_by_match, markup );
return markup->str;
}
void
search_append_match_info( gpointer data, gpointer user_data )
{
const gchar *markup;
GList *match;
GtkWidget *window_box, *label, *list_box;
match = (GList *) data;
window_box = GTK_WIDGET( user_data );
list_box = gtk_widget_get_last_child( window_box );
label = gtk_label_new( NULL );
markup = get_markup_from_match( match );
gtk_label_set_markup( GTK_LABEL( label ), markup );
gtk_box_append( GTK_BOX( list_box ), label );
}
gint
match_list_compare( gconstpointer a_, gconstpointer b_ )
{
GList *a = (GList *) a_;
GList *b = (GList *) b_;
Match *match_a = (Match *) a->data;
Match *match_b = (Match *) b->data;
return match_compare( match_a, match_b );
}
GtkWidget *
create_filtered_list_box( gchar *text, GtkWidget *window_box )
{
GtkWidget *search_bar, *search_bar_box, *list_box, *label;
gchar *markup;
/* Initialize GList pointers to NULL.
*/
GList *strings, *found, *found0, *found1, *s0, *s1, *t;
strings = found = found0 = found1 = s0 = s1 = t = NULL;
search_bar = gtk_widget_get_first_child( window_box );
search_bar_box = gtk_search_bar_get_child( GTK_SEARCH_BAR( search_bar ) );
list_box = gtk_box_new( GTK_ORIENTATION_VERTICAL, PADDING_ZERO_PIXELS );
gtk_box_append( GTK_BOX( window_box ), list_box );
strings = get_strings();
found = find_strings_with_substring( strings, (gchar *) text );
if ( ! g_list_length( found ) )
return list_box;
/* Sort by increasing k.
*/
found = g_list_sort( found, match_list_compare );
/* DBG: Print all matches.
*/
//GList *t = found, *s;
//g_assert( t->data );
//while ( t ) {
// s = (GList *) t->data;
// g_list_foreach( s, Match_print, NULL );
// t = t->next;
// puts("");
//}
/* Get two GLists of GLists - one with the k=0 matches, and another with
* the k>0 matches.
*/
while ( found ) {
t = (GList *) found->data;
s0 = s1 = NULL;
while ( t ) {
Match *match = (Match *) t->data;
if ( match ) {
if ( match->k )
s1 = g_list_append( s1, match );
else
s0 = g_list_append( s0, match );
t = t->next;
}
}
if ( s0 )
found0 = g_list_append( found0, s0 );
if ( s1 )
found1 = g_list_append( found1, s1 );
found = found->next;
}
/* Add the exact (k=0) matches, if any.
*/
if ( g_list_length( found0 ) ) {
label = gtk_label_new( NULL );
markup = "<b>Exact Matches</b>";
gtk_label_set_markup( GTK_LABEL( label ), markup );
gtk_box_append( GTK_BOX( list_box ), label );
g_list_foreach( found0, search_append_match_info, window_box );
}
/* Add the fuzzy (k>0) matches, if any.
*/
if ( g_list_length( found1 ) ) {
label = gtk_label_new( NULL );
markup = "<b>Fuzzy Matches</b>";
/* Add vertical space after exact matches, if any.
*/
if ( g_list_length( found0 ) )
gtk_widget_set_margin_top( label, 10 );
gtk_label_set_markup( GTK_LABEL( label ), markup );
gtk_box_append( GTK_BOX( list_box ), label );
g_list_foreach( found1, search_append_match_info, window_box );
}
return list_box;
}
void
reset_list_box( GtkWidget *window_box )
{
GtkWidget *search_bar, *search_bar_box, *list_box;
GList *strings;
search_bar = gtk_widget_get_first_child( window_box );
search_bar_box = gtk_search_bar_get_child( GTK_SEARCH_BAR( search_bar ) );
list_box = create_initial_list_box();
gtk_box_append( GTK_BOX( window_box ), list_box );
}
void
search_changed( GtkWidget *search, gpointer user_data )
{
GtkWidget *search_bar, *search_box, *search_entry, *window_box,
*list_box;
GList *strings, *found;
const gchar *text;
window_box = GTK_WIDGET( user_data );
search_bar = gtk_widget_get_first_child( window_box );
search_box = gtk_search_bar_get_child( GTK_SEARCH_BAR( search_bar ) );
search_entry = gtk_widget_get_first_child( search_box );
list_box = gtk_widget_get_last_child( window_box );
gtk_box_remove( GTK_BOX( window_box ), list_box );
text = gtk_editable_get_text( GTK_EDITABLE( search_entry) );
if ( !text || !strlen( text ) )
reset_list_box( window_box );
else
create_filtered_list_box( (gchar *) text, window_box );
}
void
activate( GtkApplication *app, gpointer user_data )
{
GtkWidget *window, *window_box, *search_bar, *search_bar_box,
*search_entry, *initial_list_box;
window = gtk_application_window_new( app );
window_box = gtk_box_new( GTK_ORIENTATION_VERTICAL, PADDING_ZERO_PIXELS );
gtk_window_set_child( GTK_WINDOW( window ), window_box );
search_bar = gtk_search_bar_new();
gtk_search_bar_set_search_mode( GTK_SEARCH_BAR( search_bar ), TRUE );
search_bar_box = gtk_box_new( GTK_ORIENTATION_VERTICAL, PADDING_ZERO_PIXELS );
gtk_search_bar_set_child( GTK_SEARCH_BAR( search_bar ), search_bar_box );
search_entry = gtk_search_entry_new();
gtk_box_append( GTK_BOX( search_bar_box ), search_entry );
gtk_search_bar_connect_entry( GTK_SEARCH_BAR( search_bar ),
GTK_EDITABLE( search_entry ) );
g_signal_connect( search_entry, "search-changed",\
G_CALLBACK( search_changed ), window_box );
gtk_box_append( GTK_BOX( window_box ), search_bar );
initial_list_box = create_initial_list_box();
gtk_box_append( GTK_BOX( window_box ), initial_list_box );
gtk_widget_show( window );
}
int
main( int argc, char** argv )
{
GtkApplication *app;
int status;
app = gtk_application_new("org.gtk.example", G_APPLICATION_FLAGS_NONE );
g_signal_connect( app, "activate", G_CALLBACK( activate ), NULL );
status = g_application_run( G_APPLICATION( app ), argc, argv );
g_object_unref( app );
return status;
}
@angstyloop
Copy link
Author

angstyloop commented Jun 12, 2023

gtk4_search_with_mismatches_demo

@angstyloop
Copy link
Author

angstyloop commented Jun 12, 2023

Notes

  • A maximum number of allowed mismatches of 3 was selected. You must decide on your own limit for your application. I don't think that allowing mismatches up to the length of the pattern string is very helpful. I think strings with only a few mismatches are more appropriate for munging. If there are too many allowed mismatches, as the user I start to wonder "why am I being shown this"?

  • Exact matches are bold in the result text, but I don't think we want to highlight inexact matches. It would look weird when the user types in "bbb" and sees "png" highlighted in the tag name. That's not helpful to see, unlike a highlighted exact match.

Recent Improvements

  • Results are sorted by number of mismatches k, but the matches with k=0 should be set apart even more somehow. A clickable "view more" button or label that always shows at the end of the list seems like a good approach to me. If we only show it when there are no exact matches, we lose a little bit of the functionality. Discreetly giving the option to always toggle fuzzy matches on/off seems like a good way to go.
  • Results are sorted by number of mismatches k. Results with k=0 (exact matches) are set apart from results with k>0 (fuzzy matches) using headings.
  • Exact matches are bold in the result text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment