Skip to content

Instantly share code, notes, and snippets.

@LeeMeng2020
Created October 4, 2020 13:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save LeeMeng2020/0bf5500b768cc96d996850daf075e93a to your computer and use it in GitHub Desktop.
Save LeeMeng2020/0bf5500b768cc96d996850daf075e93a to your computer and use it in GitHub Desktop.
This one was interesting; I wanted to figure out a way to limit the Load More. The sitemap below will stop at 200 results. More details in the attached text file.
{
"_id": "cbc-load-more",
"startUrl": ["https://www.cbc.ca/search?q=quebec%20tourism&section=all&sortOrder=relevance&media=all"],
"selectors": [{
"id": "Separate Load More",
"type": "SelectorElementClick",
"parentSelectors": ["_root"],
"selector": " div.contentListCards",
"multiple": false,
"delay": "3700",
"clickElementSelector": "div > button[class^='sclt-loadmore']:not([class*='loadmore20'])",
"clickType": "clickMore",
"discardInitialElements": "do-not-discard",
"clickElementUniquenessType": "uniqueHTML"
}, {
"id": "Row wrappers",
"type": "SelectorElement",
"parentSelectors": ["_root"],
"selector": "div.contentListCards a.card",
"multiple": true,
"delay": 0
}, {
"id": "Title",
"type": "SelectorText",
"parentSelectors": ["Row wrappers"],
"selector": "h3",
"multiple": false,
"regex": "",
"delay": 0
}, {
"id": "Time",
"type": "SelectorText",
"parentSelectors": ["Row wrappers"],
"selector": "time",
"multiple": false,
"regex": "",
"delay": 0
}, {
"id": "Link",
"type": "SelectorLink",
"parentSelectors": ["Row wrappers"],
"selector": "_parent_",
"multiple": false,
"delay": 0
}]
}
Answer for forum question at:
https://forum.webscraper.io/t/page-with-load-more-pagination-abruptly-closes-during-scraping/6252
This one was interesting; I wanted to figure out a way to limit the Load More. Try the sitemap below which will stop at 200 results. I recommend Page load delay of at least 5000.
This sitemap will click on all the Load More first so it might look like nothing much is happening for a while. The results are actually loading below the screen, and will be indicated by "Showing results 1 – XXX of" which will change every few seconds.
If you want more/fewer pages, you'll need to do some math to figure out which Load More to stop at, and then change the Load More selector,
div > button[class^='sclt-loadmore']:not([class*='loadmore20'])
Each Load More loads an additional 10 results. In this example, it will stop at loadmore20, so 20 x 10 = 200 results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment