Skip to content

Instantly share code, notes, and snippets.

@dmacjam
Last active August 23, 2017 20:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dmacjam/66211e9149ad72537bc12d0e483ea9e1 to your computer and use it in GitHub Desktop.
Save dmacjam/66211e9149ad72537bc12d0e483ea9e1 to your computer and use it in GitHub Desktop.
Jakub Macina @dmacjam Discourse GSoC17 Submission Report

Google Summer of Code 2017 at Discourse

Advanced search enhancements & Autospec gem

Code

1. Search posts which contains all tags

  • Description: New feature is added as a checkbox Contains all tags to advanced search. If checkbox is checked, separator between tags is '+' (represents logical AND). Example search: tag:pr-welcome+foobar. Backend for search is done in PostgreSQL built-in full text search and it has same performance for arbitrary number of tags.
  • Code: discourse/discourse#4896
  • Status: Merged

2. Search posts with images

  • Description: Searching for posts with images in advanced search by selecting filter Includes image(s) in section Only return topics/posts that… or typing in:image to search query. It is working for uploaded images, images pasted as links and oneboxed images.
  • Code: discourse/discourse#4917
  • Status: Merged

3. Search posts by filetype

  • Description: Searching for posts which contain files with explicit file extensions by querying <keywords> filetype:pdf. Added extraction of file extensions and persisting it to database.
  • Code: discourse/discourse#4958
  • Status: Merged

4. Extend number of search results by infinite scrolling

  • Description: Added infinite loading of search results, in total 10 pages of 50 results are loaded when scrolling which results in 500 search results (not more because of the performance, even Google limits maximum 1000 results per query). If there are still more results, it tells users to narrow down their search criteria.
  • Code: discourse/discourse#4981
  • Status: Merged
  • Note: I changed my plan and focused on this task instead of autospec gem as it was more important for Discourse.

Reports

Week #1

What I've been working on

  • Added new checkbox Contains all tags to advanced search frontend.
  • Added new separator '+' if the checkbox is checked.
  • Created PostgreSQL built-in full text search query.
  • Added rspec tests and move other test related to tags into one context.
  • Tried to add qunit acceptance tests and found a similar code which is commented and the commit is saying: Add acceptance tests for the rest of the advanced search UI (except for Tags, tags are evil) (after spending a day on it I understand why this commit message :D ).

What I discussed with my mentor

  • Which separator to use between tags - there was problem with ampersand sign and thanks @neil for a recommendation of the plus sign.
  • whether it is good to refactor tests

At the end of the first week, I manged to created a pull request (in my application I had two weeks planned for this task but everything went smootly): discourse/discourse#4896

What I'll be working on next

Task is done.

Week #2

What I've been working on

  • Add new filter Include(s) images to advanced search
  • Add extraction of image_url attribute of posts for oneboxed images (e.g. Imgur).
  • Add rspec test for searching posts with images.
  • Add backend code for searching posts with images.

I decided to split the task into two parts:

  1. searching by images - it is very common use case
  2. searching by any filetype, e.g. filetype: pdf

Here is the pull request for searching by images: discourse/discourse#4917

What I discussed with my mentor

  • How to design searching by any filetype - we agreed on adding new column filetype to TopicLink.
  • How to extract file extensions.

What I'll be working on next

  • Fix the syntax of the filter as suggested in pull request review.
  • Add extraction of the links filetypes in posts.

Week #3

What I've been working on

  • Fix filter name for searching posts by images to with:images.
  • Fix order of processing in CookedPostProcessor.
  • Found an issue with extraction of links from posts, which is not called when downloading the remote images is disabled.

What I discussed with my mentor

  • Pull requests merging.

What I'll be working on next

  • Add input field for searching posts by filetype to advanced search UI.
  • Add rspec tests for filetype extraction from link in post.
  • Add extraction of a link filetype in a post.
  • Add backend code for searching by any filetype: filetype:pdf.

Week #4

What I've been working on

  • Add file extension column to TopicLinks.
  • Add extraction of link extension.
  • Add rspec tests for searching by filetypes, e.g. filetype:pdf.
  • Add backend code for searching by filetypes.
  • Testing with different site settings - it is not yet fully working with images which are not lightboxed.

What I discussed with my mentor

  • Upgrading Discourse versions - rebaking of posts.
  • How to solve issue with images that are not inside lightbox.

What I'll be working on next

  • Fix problem with images which are not lightboxed.
  • Add input field for filetypes to advanced search UI.
  • Do a research on limited search results and propose solutions.

Week #5

What I've been working on

  • Fix issue with images which are not lightboxed by adding column with file extension to Uploads.
  • Change existing methods to use upload extension attribute (before it was dynamically computed).
  • Add onceoff job for extracting extension of existing uploads.
  • Update backend code for searching as a combination of links and uploads extensions.
  • Add input field for filetypes to advanced search UI - it's not so intuitive (see screencast below) and we decided to not include it (searching posts with filetypes will be power user feature - it is necessary to type search as: keyword filetype:xxx).

https://i.imgur.com/lSvDupn.gif

Finally, pull request is available here: discourse/discourse#4958

What I discussed with my mentor

  • Using database limit on columns length (whether to raise and error or cut the length it in the code when it exceeds limit).
  • User-friendliness of input fields.
  • The idea of onceoff jobs.

What I'll be working on next

  • Start working on a limited search results issue.

Week #6

What I've been working on

  • Fix onceoff job for extracting extension of existing uploads with batch loading.
  • Simplifying code in search posts by filetype pull request.
  • Add pull request fixing bug in populate command.
  • I tried to understand the backend code for searching, i.e. how posts are ranked, limited and aggregated. I analyzed the possibilities of fixing the limited search results and I focused on infinite loading. Here is my analysis with two alternatives which can be used on the database level:

Offset - Better for paging with 50 results and setting some reasonable maximum number of results (in this case 200 or 500).

  • Easier to implement – only need to pass page number and set offset in DB query.
  • Stable time performance for each page – top-n heapsort which compares rank of every post which contain keywords.
  • Increasing memory performance with each additional page - it will read the rows already shown on the previous pages and discard them before finally reaching the results for the next page.
  • If new post which satisfy search criteria is added between search, the new page will contain duplicate. (http://use-the-index-luke.com/no-offset)

Seek method - Better performance which is more suitable if API with all results is needed.

  • More complicated code – need to store and pass 4 values, which are now used to order search results:
    • Cover density rank between query and topic title
    • Cover density rank between query and search data
    • Last update of topic/post
    • Topic/Post id
  • Slightly better performance for each additional page - results from previous pages are filtered out in where clause.
  • Stable memory usage – equal to number of results returned.

Further comparison of these approaches is available here: https://www.slideshare.net/MarkusWinand/p2d2-pagination-done-the-postgresql-way

What are your requirements? What do you think is better to use? In my opinion, I would focus on offset (and removing aggregation of posts within one topic) because it is easier to implement and I don't expect users would go through more than 500 results (for example Google limits maximum 1000 results per query). And if there is more results, then tell user to narrow down the search criteria.

What I discussed with my mentor

  • Performance testing in development mode - available backups and population of database with posts.

What I'll be working on next

  • Start implementing fix for limited search results.

Week #7

What I've been working on

I've created PR fixing limited search results by informing user if there are more search results available. Moreover, I added support for backend search pagination (using offset) which will be needed for infinite loading of more search results.

discourse/discourse#4981

What I discussed with my mentor

  • Offset vs. seek method mentioned in my previous report.
  • How to reuse existing infinite loading code for full page search results.

What I'll be working on next

Infinite loading of more search results. My idea is to add "infinite" loading of 10 more additional pages, in total of maximum 500 search results (because of offset performance). It would require a lot of changes at frontend. Shall I focus on that? Or removing aggregation of posts within one topic in search results? There also might be a trade-off to increase the limit of search results to 100 without a pagination.

Week #8

What I’ve been working on

  • Improved rspec tests for search pagination.
  • Fixed documentation about how to run sidekiq.
  • Added infinite loading of search results to advanced search frontend - in total 10 pages of 50 results are loaded when scrolling which results in 500 search results. If there are still more results, it tells users to modify their search query.

What I discussed with my mentor

  • Whether it’s a good idea to split big pull request into smaller.
  • Weird git conflicts in pull requests.

What I’ll be working on next

The task is finished - pull request is ready for review.

https://meta.discourse.org/t/search-results-limited-to-50-each-time/34270/12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment