- Name: Jakub Macina
- Nickname: @dmacjam
- Email: jakub.macina@gmail.com
- Personal Website: https://dmacjam.github.io
- Project description: https://summerofcode.withgoogle.com/projects/#5732151531143168
- Project proposal: https://gist.github.com/dmacjam/887e50a4d1b349f6b7e44f6302314d62
- List of GSOC contributions: https://github.com/discourse/discourse/pulls?utf8=%E2%9C%93&q=is%3Apr%20author%3Admacjam%20created%3A2017-05-20..2017-08-20%20
1. Search posts which contains all tags
- Description: New feature is added as a checkbox
Contains all tags
to advanced search. If checkbox is checked, separator between tags is '+' (represents logical AND). Example search:tag:pr-welcome+foobar
. Backend for search is done in PostgreSQL built-in full text search and it has same performance for arbitrary number of tags. - Code: discourse/discourse#4896
- Status: Merged
2. Search posts with images
- Description: Searching for posts with images in advanced search by selecting filter
Includes image(s)
in section Only return topics/posts that… or typingin:image
to search query. It is working for uploaded images, images pasted as links and oneboxed images. - Code: discourse/discourse#4917
- Status: Merged
3. Search posts by filetype
- Description: Searching for posts which contain files with explicit file extensions by querying
<keywords> filetype:pdf
. Added extraction of file extensions and persisting it to database. - Code: discourse/discourse#4958
- Status: Merged
4. Extend number of search results by infinite scrolling
- Description: Added infinite loading of search results, in total 10 pages of 50 results are loaded when scrolling which results in 500 search results (not more because of the performance, even Google limits maximum 1000 results per query). If there are still more results, it tells users to narrow down their search criteria.
- Code: discourse/discourse#4981
- Status: Merged
- Note: I changed my plan and focused on this task instead of autospec gem as it was more important for Discourse.
- Added new checkbox
Contains all tags
to advanced search frontend. - Added new separator '+' if the checkbox is checked.
- Created PostgreSQL built-in full text search query.
- Added rspec tests and move other test related to tags into one context.
- Tried to add qunit acceptance tests and found a similar code which is commented and the commit is saying: Add acceptance tests for the rest of the advanced search UI (except for Tags, tags are evil) (after spending a day on it I understand why this commit message :D ).
- Which separator to use between tags - there was problem with ampersand sign and thanks @neil for a recommendation of the plus sign.
- whether it is good to refactor tests
At the end of the first week, I manged to created a pull request (in my application I had two weeks planned for this task but everything went smootly): discourse/discourse#4896
Task is done.
- Add new filter Include(s) images to advanced search
- Add extraction of
image_url
attribute of posts for oneboxed images (e.g. Imgur). - Add rspec test for searching posts with images.
- Add backend code for searching posts with images.
I decided to split the task into two parts:
- searching by images - it is very common use case
- searching by any filetype, e.g.
filetype: pdf
Here is the pull request for searching by images: discourse/discourse#4917
- How to design searching by any filetype - we agreed on adding new column
filetype
to TopicLink. - How to extract file extensions.
- Fix the syntax of the filter as suggested in pull request review.
- Add extraction of the links filetypes in posts.
- Fix filter name for searching posts by images to
with:images
. - Fix order of processing in
CookedPostProcessor
. - Found an issue with extraction of links from posts, which is not called when downloading the remote images is disabled.
- Pull requests merging.
- Add input field for searching posts by filetype to advanced search UI.
- Add rspec tests for filetype extraction from link in post.
- Add extraction of a link filetype in a post.
- Add backend code for searching by any filetype:
filetype:pdf
.
- Add file extension column to
TopicLinks
. - Add extraction of link extension.
- Add rspec tests for searching by filetypes, e.g.
filetype:pdf
. - Add backend code for searching by filetypes.
- Testing with different site settings - it is not yet fully working with images which are not lightboxed.
- Upgrading Discourse versions - rebaking of posts.
- How to solve issue with images that are not inside lightbox.
- Fix problem with images which are not lightboxed.
- Add input field for filetypes to advanced search UI.
- Do a research on limited search results and propose solutions.
- Fix issue with images which are not lightboxed by adding column with file extension to
Uploads
. - Change existing methods to use upload extension attribute (before it was dynamically computed).
- Add onceoff job for extracting extension of existing uploads.
- Update backend code for searching as a combination of links and uploads extensions.
- Add input field for filetypes to advanced search UI - it's not so intuitive (see screencast below) and we decided to not include it (searching posts with filetypes will be power user feature - it is necessary to type search as:
keyword filetype:xxx
).
https://i.imgur.com/lSvDupn.gif
Finally, pull request is available here: discourse/discourse#4958
- Using database limit on columns length (whether to raise and error or cut the length it in the code when it exceeds limit).
- User-friendliness of input fields.
- The idea of onceoff jobs.
- Start working on a limited search results issue.
- Fix onceoff job for extracting extension of existing uploads with batch loading.
- Simplifying code in search posts by filetype pull request.
- Add pull request fixing bug in populate command.
- I tried to understand the backend code for searching, i.e. how posts are ranked, limited and aggregated. I analyzed the possibilities of fixing the limited search results and I focused on infinite loading. Here is my analysis with two alternatives which can be used on the database level:
Offset - Better for paging with 50 results and setting some reasonable maximum number of results (in this case 200 or 500).
- Easier to implement – only need to pass page number and set offset in DB query.
- Stable time performance for each page – top-n heapsort which compares rank of every post which contain keywords.
- Increasing memory performance with each additional page - it will read the rows already shown on the previous pages and discard them before finally reaching the results for the next page.
- If new post which satisfy search criteria is added between search, the new page will contain duplicate. (http://use-the-index-luke.com/no-offset)
Seek method - Better performance which is more suitable if API with all results is needed.
- More complicated code – need to store and pass 4 values, which are now used to order search results:
- Cover density rank between query and topic title
- Cover density rank between query and search data
- Last update of topic/post
- Topic/Post id
- Slightly better performance for each additional page - results from previous pages are filtered out in where clause.
- Stable memory usage – equal to number of results returned.
Further comparison of these approaches is available here: https://www.slideshare.net/MarkusWinand/p2d2-pagination-done-the-postgresql-way
What are your requirements? What do you think is better to use? In my opinion, I would focus on offset (and removing aggregation of posts within one topic) because it is easier to implement and I don't expect users would go through more than 500 results (for example Google limits maximum 1000 results per query). And if there is more results, then tell user to narrow down the search criteria.
- Performance testing in development mode - available backups and population of database with posts.
- Start implementing fix for limited search results.
I've created PR fixing limited search results by informing user if there are more search results available. Moreover, I added support for backend search pagination (using offset) which will be needed for infinite loading of more search results.
- Offset vs. seek method mentioned in my previous report.
- How to reuse existing infinite loading code for full page search results.
Infinite loading of more search results. My idea is to add "infinite" loading of 10 more additional pages, in total of maximum 500 search results (because of offset performance). It would require a lot of changes at frontend. Shall I focus on that? Or removing aggregation of posts within one topic in search results? There also might be a trade-off to increase the limit of search results to 100 without a pagination.
- Improved rspec tests for search pagination.
- Fixed documentation about how to run sidekiq.
- Added infinite loading of search results to advanced search frontend - in total 10 pages of 50 results are loaded when scrolling which results in 500 search results. If there are still more results, it tells users to modify their search query.
- Whether it’s a good idea to split big pull request into smaller.
- Weird git conflicts in pull requests.
The task is finished - pull request is ready for review.
https://meta.discourse.org/t/search-results-limited-to-50-each-time/34270/12