SEO in a CRA project using lambda edge and prerender.io


Recently I came across the problem of SEO on a project using CRA. Since projects created using React are SPAs, crawlers on the web are unable to distinguish between the different paths in your website. Although googlebots now have the functionality of rendering javascript, you still want to be available to other crawlers. Being located in South Korea we needed to be able to serve to crawlers of local search engines. However we didn't need the entire website to be crawlable. Being a B2B SAAS Service we only really needed the outer landing/informative pages to be crawled/indexed.
After contemplating bewteen multiple options to achieve this we decided to use prerendering.
Even within prerendering there were multiple options.
React-snap, prerender.io, etc
First I tried to use react-snap but it seemed that it was no longer maintained and had too many bugs fix requests. It says its good to go right away but there were some fixes you had to add and it wasn't working too well with my project.
The second option, and the option I ended up going with, was using prerender.io with lambda edge.
I was able to implement this method referencing the following links
https://github.com/jinty/prerender-cloudfront
https://stackoverflow.com/questions/22383239/single-page-app-amazon-s3-amazon-cloudfront-prerender-io-how-to-set-up
However the path to successfully implementing wasn't smooth since I wasn't completely sure on how every part of this method worked at first. I'm writing this in hopes that it will help others who are facing the same problem.
Since our frontend is deployed through AWS services, using S3 and cloudfront, we had to implement a new server if we wanted to go with the method described in the prerender.io docs.
I didnt want an extra server to maintain so I decided to use the lambda edge service that allows you to run functions around Cloudfront.
Lambda edge functions can be triggered by the origin/viewer requests and origin/viewer responses around Cloudfront.
When the website url gets called(the viewer request), I could check whether the request was from a bot or a user.
Then If the request is from a bot, I can add extra headers to include the prerender.io token and the host name.
Here we also check whether the request is a url request and we only add the headers if the request isn't for something else like a robots.txt file.
This part is important because when you run test live url in the google search console, it won't work because the google bot also requests the robots.txt file which shouldn't(can't?) be prerendered.
Another important thing here is the x-prerender-cachebuster header. This header exists to bust the cache as the name says. Since Cloudfront caches the info it returns, if you redirect to prerender.io by live testing at the google search console right after deploying the lambda edge functions it will cache the prerendered page and return it to users too. You have to create a new cache policy to check the added headers to prevent this from happening. Then Cloudfront will return different versions depending on the added headers.
Also the cachebuster has the value of the date since you can set the recaching Interval in Prerender.io. I guess if your pages don't change too often it will of less importance but the cachebuster will make the cloudfront recache as the date changes.
I personally did not know this beforehand and wasted a couple of days trying to figure out why the call was returning the preredered pages to user requests. Now that I think about it, it's pretty obvious why...
exports.handler = (event, context, callback) => {
    'use strict';
    const request = event.Records[0].cf.request;
    const headers = request.headers;
    const user_agent = headers['user-agent'];
    const host = headers['host'];

    if (user_agent && host) {
      if (!request.uri.match(/\.(js|css|xml|less|png|jpg|jpeg|gif|pdf|doc|txt|ico|rss|zip|mp3|rar|exe|wmv|doc|avi|ppt|mpg|mpeg|tif|wav|mov|psd|ai|xls|mp4|m4a|swf|dat|dmg|iso|flv|m4v|torrent|ttf|woff|svg|eot)/i)) {
        if (/baiduspider|Googlebot|Facebot|facebookexternalhit|twitterbot|rogerbot|linkedinbot|embedly|quora link preview|showyoubot|outbrain|pinterest|slackbot|vkShare|W3C_Validator/.test(user_agent[0].value)) {
          console.log("added token", host[0].value);
          headers['x-prerender-token'] = [{ key: 'X-Prerender-Token', value: 'your-token-goes-here'}];
          headers['x-prerender-host'] = [{ key: 'X-Prerender-Host', value: host[0].value}];
          headers['x-prerender-cachebuster'] = [{ key: 'X-Prerender-Cachebuster', value: Date.now().toString()}];
        }
      }
    }
    callback(null, request);
};
Then the request goes through cloudfront and a origin request gets created. Here, another lambda edge function gets called and depending on whether the extra headers are added are not, the request goes either to the S3 bucket or gets redirect to prerender.io.
Using postman to call to service.prerender.io will return to you the prerendered pages stored in prerender.io
exports.handler = (event, context, callback) => {
   'use strict';
   const request = event.Records[0].cf.request;
   if (request.headers['x-prerender-token'] && request.headers['x-prerender-host']) {
    request.origin = {
         custom: {
             domainName: 'service.prerender.io',
             port: 443,
             protocol: 'https',
             readTimeout: 20,
             keepaliveTimeout: 5,
             customHeaders: {},
             sslProtocols: ['TLSv1', 'TLSv1.1'],
             path: '/https%3A%2F%2F' + request.headers['x-prerender-host'][0].value
         }
    };
  }
  callback(null, request);
};
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/controlling-origin-requests.html
you have to add this to get headers added in the viewer request in origin request if you have a behavior