dmacjam/discourse-gsoc17-submission.md

## discourse-gsoc17-submission.md

      
    Raw
  

              discourse-gsoc17-submission.md
            
          
    Google Summer of Code 2017 at Discourse


Name: Jakub Macina
Nickname: @dmacjam
Email: jakub.macina@gmail.com
Personal Website: https://dmacjam.github.io

Advanced search enhancements & Autospec gem


Project description: https://summerofcode.withgoogle.com/projects/#5732151531143168
Project proposal: https://gist.github.com/dmacjam/887e50a4d1b349f6b7e44f6302314d62
List of GSOC contributions: https://github.com/discourse/discourse/pulls?utf8=%E2%9C%93&q=is%3Apr%20author%3Admacjam%20created%3A2017-05-20..2017-08-20%20

Code

1. Search posts which contains all tags

Description: New feature is added as a checkbox Contains all tags to advanced search. If checkbox is checked, separator between tags is '+' (represents logical AND). Example search: tag:pr-welcome+foobar. Backend for search is done in PostgreSQL built-in full text search and it has same performance for arbitrary number of tags.
Code: discourse/discourse#4896
Status: Merged

2. Search posts with images

Description: Searching for posts with images in advanced search by selecting filter Includes image(s) in section Only return topics/posts that… or typing in:image to search query. It is working for uploaded images, images pasted as links and oneboxed images.
Code: discourse/discourse#4917
Status: Merged

3. Search posts by filetype

Description: Searching for posts which contain files with explicit file extensions by querying <keywords> filetype:pdf. Added extraction of file extensions and persisting it to database.
Code: discourse/discourse#4958
Status: Merged

4. Extend number of search results by infinite scrolling

Description: Added infinite loading of search results, in total 10 pages of 50 results are loaded when scrolling which results in 500 search results (not more because of the performance, even Google limits maximum 1000 results per query). If there are still more results, it tells users to narrow down their search criteria.
Code: discourse/discourse#4981
Status: Merged
Note: I changed my plan and focused on this task instead of autospec gem as it was more important for Discourse.

Reports

Week #1

What I've been working on


Added new checkbox Contains all tags to advanced search frontend.
Added new separator '+' if the checkbox is checked.
Created PostgreSQL built-in full text search query.
Added rspec tests and move other test related to tags into one context.
Tried to add qunit acceptance tests and found a similar code which is commented and the commit is saying:  Add acceptance tests for the rest of the advanced search UI (except for Tags, tags are evil)
(after spending a day on it I understand why this commit message  :D ).

What I discussed with my mentor


Which separator to use between tags - there was problem with ampersand sign and thanks @neil for a recommendation of the plus sign.
whether it is good to refactor tests

At the end of the first week, I manged to created a pull request (in my application I had two weeks planned for this task but everything went smootly): discourse/discourse#4896
What I'll be working on next

Task is done.
Week #2

What I've been working on


Add new filter Include(s) images to advanced search
Add extraction of image_url attribute of posts for oneboxed images (e.g. Imgur).
Add rspec test for searching posts with images.
Add backend code for searching posts with images.

I decided to split the task into two parts:

searching by images - it is very common use case
searching by any filetype, e.g. filetype: pdf

Here is the pull request for searching by images:
discourse/discourse#4917
What I discussed with my mentor


How to design searching by any filetype - we agreed on adding new column filetype to TopicLink.
How to extract file extensions.

What I'll be working on next


Fix the syntax of the filter as suggested in pull request review.
Add extraction of the links filetypes in posts.

Week #3

What I've been working on


Fix filter name for searching posts by images to with:images.
Fix order of processing in CookedPostProcessor.
Found an issue with extraction of links from posts, which is not called when downloading the remote images is disabled.

What I discussed with my mentor


Pull requests merging.

What I'll be working on next


Add input field for searching posts by filetype to advanced search UI.
Add rspec tests for filetype extraction from link in post.
Add extraction of a link filetype in a post.
Add backend code for searching by any filetype: filetype:pdf.

Week #4

What I've been working on


Add file extension column to TopicLinks.
Add extraction of link extension.
Add rspec tests for searching by filetypes, e.g. filetype:pdf.
Add backend code for searching by filetypes.
Testing with different site settings - it is not yet fully working with images which are not lightboxed.

What I discussed with my mentor


Upgrading Discourse versions - rebaking of posts.
How to solve issue with images that are not inside lightbox.

What I'll be working on next


Fix problem with images which are not lightboxed.
Add input field for filetypes to advanced search UI.
Do a research on limited search results and propose solutions.

Week #5

What I've been working on


Fix issue with images which are not lightboxed by adding column with file extension to Uploads.
Change existing methods to use upload extension attribute (before it was dynamically computed).
Add onceoff job for extracting extension of existing uploads.
Update backend code for searching as a combination of links and uploads extensions.
Add input field for filetypes to advanced search UI - it's not so intuitive (see screencast below) and we decided to not include it (searching posts with filetypes will be power user feature - it is necessary to type search as: keyword filetype:xxx).

https://i.imgur.com/lSvDupn.gif
Finally, pull request is available here:
discourse/discourse#4958
What I discussed with my mentor


Using database limit on columns length (whether to raise and error or cut the length it in the code when it exceeds limit).
User-friendliness of input fields.
The idea of onceoff jobs.

What I'll be working on next


Start working on a limited search results issue.

Week #6

What I've been working on


Fix onceoff job for extracting extension of existing uploads with batch loading.
Simplifying code in search posts by filetype pull request.
Add pull request fixing bug in populate command.
I tried to understand the backend code for searching, i.e. how posts are ranked, limited and aggregated. I analyzed the possibilities of fixing the limited search results and I focused on infinite loading. Here is my analysis with two alternatives which can be used on the database level:

Offset - Better for paging with 50 results and setting some reasonable maximum number of results (in this case 200 or 500).

Easier to implement – only need to pass page number and set offset in DB query.
Stable time performance for each page – top-n heapsort which compares rank of every post which contain keywords.
Increasing memory performance with each additional page - it will read the rows already shown on the previous pages and discard them before finally reaching the results for the next page.
If new post which satisfy search criteria is added between search, the new page will contain duplicate. (http://use-the-index-luke.com/no-offset)

Seek method - Better performance which is more suitable if API with all results is needed.

More complicated code – need to store and pass 4 values, which are now used to order search results:

Cover density rank between query and topic title
Cover density rank between query and search data
Last update of topic/post
Topic/Post id


Slightly better performance for each additional page - results from previous pages are filtered out in where clause.
Stable memory usage – equal to number of results returned.

Further comparison of these approaches is available here: https://www.slideshare.net/MarkusWinand/p2d2-pagination-done-the-postgresql-way
What are your requirements? What do you think is better to use?
In my opinion, I would focus on offset (and removing aggregation of posts within one topic) because it is easier to implement and I don't expect users would go through more than 500 results (for example Google limits maximum 1000 results per query). And if there is more results, then tell user to narrow down the search criteria.
What I discussed with my mentor


Performance testing in development mode - available backups and population of database with posts.

What I'll be working on next


Start implementing fix for limited search results.

Week #7

What I've been working on

I've created PR fixing limited search results by informing user if there are more search results available. Moreover, I added support for backend search pagination (using offset) which will be needed for infinite loading of more search results.
discourse/discourse#4981
What I discussed with my mentor


Offset vs. seek method mentioned in my previous report.
How to reuse existing infinite loading code for full page search results.

What I'll be working on next

Infinite loading of more search results. My idea is to add "infinite" loading of 10 more additional pages, in total of maximum 500 search results (because of offset performance). It would require a lot of changes at frontend. Shall I focus on that? Or removing aggregation of posts within one topic in search results? There also might be a trade-off to increase the limit of search results to 100 without a pagination.
Week #8

What I’ve been working on


Improved rspec tests for search pagination.
Fixed documentation about how to run sidekiq.
Added infinite loading of search results to advanced search frontend - in total 10 pages of 50 results are loaded when scrolling which results in 500 search results. If there are still more results, it tells users to modify their search query.

What I discussed with my mentor


Whether it’s a good idea to split big pull request into smaller.
Weird git conflicts in pull requests.

What I’ll be working on next

The task is finished - pull request is ready for review.
https://meta.discourse.org/t/search-results-limited-to-50-each-time/34270/12