anaarezo/SEARCH-ENGINE-BLOCKER.md

## SEARCH-ENGINE-BLOCKER.md

      
    Raw
  

              SEARCH-ENGINE-BLOCKER.md
            
          
    How to prevent Next.js websites from appearing in SEO, search engines, google crawler?


1. Disallow robot.txt

1.1. Use "noindex" meta tags
1.2 Use the robots.txt file
1.3 Use HTTP "X-Robots-Tag" headers
1.4 Prevent indexing a domain on AWS using ELB Listener
1.5 Adding X-Robots-Tag in Next.js


2. Protect the web application with a password
3. Use the wrong "canonical" tag
4. Remove URLs in Google Search Console and Bing Webmaster Tools
5. Insert duplicate content
6. JavaScript or meta refresh redirects
7. Block search engine Bots

7.1 Block bots by IPs
7.2 AWS WAF Bot Control


To block SEO for a site and ensure it never appears in any search from any search method, is possible to adopt several techniques. Here are the main ones:
1. Disallow robot.txt

1.1. Use "noindex" meta tags

Links:

Special Meta Tags for Search Engines
Block Search indexing with noindex
Block Bad Bots

This instructs search engines not to index the specific page. Add the following meta tag in the HTML head of each page you want to block.
<meta name="robots" content="noindex,nofollow" />
For Single Page Applications (SPAs)

If your project is a SPA, you might need to dynamically set meta tags or serve a pre-rendered version of your pages with the noindex meta tag to search engines. Tools and frameworks like Angular Universal, React Helmet, or Next.js can help manage SEO-related meta tags dynamically.
Note:

The robots.txt method tells search engines not to crawl the pages, but it doesn't guarantee that the pages won't appear in search results if they are linked to other sites. The meta tag approach is more effective in ensuring pages are not indexed.
These methods rely on the cooperation of search engine bots and respect for the directives. Most reputable search engines will follow these instructions, but it's possible for some bots to ignore them.
1.2 Use the robots.txt file

Links:

How to add a robots.txt file to a Next.js project

The robots.txt file can instruct search engines not to index your site or parts of it. This instructs all search bots not to index any page of the site. To add a robots.txt file to a Next.js project based on the static file serving in Next.js is easy to add a robots.txt file, we would create a new file named robots.txt in the public folder in the root directory.
Place the file in the root of your site with the following content like the examples below:
//robots.txt

# Block all crawlers for /accounts
User-agent: *
Disallow: /accounts

# Block all files
User-agent: *
Disallow: /

# Allow all crawlers
User-agent: *
Allow: /
When you run your app with npm run dev, it will now be available at http://localhost:3000/robots.txt. Note that the public folder name is not part of the URL. Do not name the public directory anything else. The name cannot be changed and is the only directory used to serve static assets.
1.3 Use HTTP X-Robots-Tag headers

Links:

Understanding the X-Robots-Tag HTTP Header
Add custom headers to Next.js responses based on the environment
The Web Robots Pages

Configure your web server to add an HTTP header that instructs search engines not to index the content. For example, on an Apache server, you can add the following to the .htaccess file:
In a Next.js application, you can dynamically add the X-Robots-Tag header in your API routes or use middleware. For example, to prevent a page from being indexed, you can modify the headers in your getServerSideProps or getStaticProps functions:
export async function getServerSideProps(context) {
  context.res.setHeader('X-Robots-Tag', 'noindex, nofollow');
  return { props: {} };
}
This approach allows for dynamic control over indexing, making it a versatile choice for applications built with Next.js. It's useful for pages that should not appear in search results, such as user dashboard pages or unpublished content.
1.4 Prevent indexing a domain on AWS using ELB Listener

Links:

How do I prevent search engine crawlers from indexing a domain on AWS

If you are using ELB, you can add a listener rule to return a fixed response for your robots.txt.
This means you do not have to have code within your site to distinguish between staging and production, and will allow you to keep your production version clean.
I tested on our public load balancers and it is possible to do but is needed to find the correct HTTP port.
1.5 Adding X-Robots-Tag in Next.js

In Next.js, custom server headers, including the X-Robots-Tag, can be configured in the next.config.js file using the headers method. Here's a step-by-step guide to adding a noindex X-Robots-Tag:
Open or create a next.config.js file in the root of your Next.js project.
Add the headers async function to export custom headers.
Specify the paths and corresponding headers you wish to add. For a global noindex, you can use the catch-all path /*.
module.exports = {
  async headers() {
    return [
      {
        source: '/admin/:path*', // Example for an admin section
        headers: [
          {
            key: 'X-Robots-Tag',
            value: 'noindex, nofollow'
          }
        ]
      }
    ]
  }
}
This configuration will instruct search engines not to index or follow any links from pages under the /admin path. Adjust the source value according to the specific pages or sections of your site that you want to apply these rules to.
2. Protect the web application with a password

Set up basic HTTP authentication to protect the site with a password. Search engines cannot index pages that require authentication to access.
3. Use the wrong "canonical" tag

Links:

What happens if Google picks the wrong canonical URL?

Place a  tag that points to a non-existent or irrelevant page. This can confuse search engines and discourage them from indexing your page.
4. Remove URLs in Google Search Console and Bing Webmaster Tools

Use Google's and Bing's webmaster tools to request the removal of specific URLs from search indices.
5. Insert duplicate content

Search engines penalize duplicate content. If your site contains many pages with duplicate content, it is less likely to be prominently indexed.
6. JavaScript or meta refresh redirects

Links:

Why meta refresh and JavaScript-redirects need to be abandoned
Next.js Universal and Nuxt.js Static Redirects

Use JavaScript or meta refresh redirects to redirect users (and search engines) to a different page. This can confuse search engines and reduce the likelihood of indexing.
7. Block search engine Bots

7.1 Block bots by IPs

Configure your server's firewall to block the IPs of major search engine bots. However, this requires constant maintenance, as IPs can change.
You can create an IP set allowlist within the AWS WAF, so IPs not listed will not be allowed, which is also a solution to this problem.
7.2 AWS WAF Bot Control

Links:

AWS Managed Rule Group Bot
WAF bot control
WAF pricing
Optimize content for search engines with AWS WAF

With Bot Control, you can easily monitor, block, or rate limit bots such as scrapers, scanners, crawlers, status monitors, and search engines. If you use the targeted inspection level of the rule group, you can also challenge bots that don't self identify, making it harder and more expensive for malicious bots to operate against your website. You can protect your applications using the Bot Control managed rule group alone, or in combination with other AWS Managed Rules rule groups and your own custom AWS WAF rules.