Dynamic SEO for a statically served SPA

A Lambda + CloudFront + S3 architecture for SPA blogs, classified listings, job boards, and more!

10 min readJul 19, 2020

Single-page applications (SPAs) are becoming more and more popular every day. There are so many pros that make them look almost superior to traditional websites in every way and very few cons on the other side of the scale.

To name a few:

Seamless navigation: The user never has to see a blank page after the website loads for the first time.
Client-side caching: No need to call the server for each page visit.
Offline accessibility: You have much more control over what happens when the user gets disconnected.
Reusable API: Your website, mobile application, and client APIs can all be the same thing!
Easy to develop and debug: I have had no love for web development until I discovered the world of SPAs and microservices.
Can be statically served: Statically serving resources has great throughput and more caching options.

On the flip side:

Big client download size: Which slows down initial page load and potentially after new releases.
CPU hungry: This could lead to bad performance on mobile devices and slow notebooks.
Challenging SEO.

I’m not here to delve deeply into the different cons and how to mitigate each of them. I’m going to get to the one point I’m going to demonstrate a solution to right away, SEO.

The Problem

Most crawlers used by even the biggest companies out there don’t evaluate JavaScript. This means that crawlers will think your whole website is just one page, with one title, one description, one thumbnail, one of everything that [should] be unique for each page.

One way around this is pre-rendering. It’s simply a neat trick that places an index.html replica with different SEO tags for all your routes. This works for websites with a finite number of pages but if we are talking about dynamic views (search pages, articles, jobs, products, …etc), this solution will not get you anywhere.

The next most traditional solution would be server-side rendering (SSR). This would give you great SEO control but it comes with its set of disadvantages.

SSR increases time to first byte (TTFB) because the user has to wait for the rendering to be done by the server before receiving anything.
It breaks the seamless UX experience SPAs are prized for.
It’s a CPU-heavy process, your servers have to do the work of all your user’s machines combined!
The S in SSR doesn’t stand for serverless.

Some hybrid frameworks try to mitigate the issue by having the best of both worlds (e.g. Gatsby). But it’s up to you to decide whether learning Gatsby and/or converting your app to gatsby is worth it, I’m not the best person to ask about it as I have no experience with it so far.

Common Architecture

…you can get perfect dynamic SEO by just adding a few more milliseconds.

There are many articles on how to set up this architecture like this one for example.

One nice thing about SPAs when not using SSR is that you can just directly serve them from an S3 bucket like a static website. The most logical step after that would be to put some cache like CloudFront in front of that S3 bucket to optimize delivery and download speeds for users wherever they are.

This is a serverless architecture, there are no servers involved other than the AWS servers handling S3 and CloudFront cache. But from your end, there are no servers whatsoever those you need to configure or maintain. Deployment is as easy as syncing that S3 bucket with a new distribution, life is good!

One more really common practice when using this architecture is to disable caching for the index.html page (or limit it to a much smaller TTL) to be able to deploy faster and avoid manual cache invalidation.

This is completely fine since most of the time spent loading your application comes from the various scripts and style sheets, not the tiny HTML page. The only cost that comes with that is a few milliseconds potentially added to load time if the user lives a bit far from the S3 server.

The reason I mentioned this, is that you need to grasp the concept that it is possible to single-out specific files and treat them differently.

Lambda@Edge

A Lambda can be added at any of the four interchange points in this architecture. Client to CF (Viewer request), CF to the client (Viewer response), CF to S3 (Origin request), or S3 to CF (Origin response). All these run on the different CloudFront caches, hence the “@Edge” part.

Existing Solution

Just by adding a Lambda either between the user and CloudFront or between CloudFront and S3 (depending on whether or not you want your index.html and SEO tags to be cached), you can respond with a custom HTML!

Some people have already come up with the idea of using a Lambda to run a chrome browser, evaluate JS, and get the right meta tags. You do the crawlers a service evaluating that JS for them and they reward you with traffic! Seems like a fair trade but, there are a few issues with this approach.

It uses A LOT of memory (0.5GB!), this adds to the Lambda cost.
It’s slow. You have to load a chromium instance and visit the page, you have to load the page twice if this is not cached!
Being slow means you are likely going to have to pay for additional concurrency quota for this Lambda if you have many users visiting at the same time.
Your app has to be updated and redeployed if you change your SEO logic.
You have to specifically target crawlers to make it more efficient.
You have to keep a list of browser agents up to date. Even then, you won’t cover all crawlers from all apps.
Due to the previous points, this is unlikely to be usable on the viewer side and would be unscalable so you are forced to cache your HTML.

But if you care about having a minimal delay, instant SEO availability, instant deployment without cache invalidation, low cost, and low maintenance read on!

Improvements

You can just inject whatever meta you want into the HTML in-flight!

I have used this solution in production myself so it’s not hypothetical. However, I haven’t seen anyone else document it during my research to solve the problem and that’s why I decided to write this article.

If you noticed, most of the issues with the previous solution can be traced back to the use of chromium to load the page. So, what if we get rid of chrome?

Lambda is capable of both retrieving S3 objects and requesting data from your backend; so you don’t need to load the page at all. You can just inject whatever meta you want into the HTML in-flight!

Let’s get to the code!

Language Selection

First, I decided to write this Lambda in Python 3.8 for the following reasons:

Python is very fast and efficient on Lambda. It has a very low memory footprint and a rapid cold start.
When migrating from AWS to other providers, Python is the most widely supported serverless function language.

Unaffected routes take 1–2ms to process! The worst-case scenario where you are retrieving the S3 object from an S3 bucket across the world takes around 1 second. Even then, there’s a way to reduce this time that I will discuss later on.

Handling Requests

Lambda@Edge supports modifying the request, skipping it, or returning your own response.

In case you didn’t know, the Viewer Response or Origin Response Lambdas don’t have access to the body. This means that there’s no need to waste time and wait for the default body to be returned before executing your Lambda.

The function to handle requests should look something like this:

import re
from fetch_html import fetch_html
from transform_state import transform_state
from rules import rules
from helpers import structure_headersdef lambda_handler(event, context):
    request = event["Records"][0]["cf"]["request"]
    
    if re.match(".*\.(js|css|jpg|png|svg)$", request["uri"]):
        return request
        
    for rule in rules: 
        if not re.match(rule["pattern"], request["uri"]):
            continue
        state = fetch_html("your.bucket.name")
        state["response_headers"] = {"content-type": "text/html"}
        status = state["status"]
        if status>= 200 and status < 400:
            transform_state(state, rule)
        return {
            "status": state["status"],
            "body": state["html"],
            "headers": structure_headers(state["response_headers"])
        }
        
    return request

That’s it! This handler returns the request as-is if it matches the exclusion pattern, or if non of the routes we are looking for are matched. If one of the routes is matched, however, it returns a response! Lambda@Edge supports modifying the request, skipping it, or returning your custom response.

This means that you will be barely affecting the retrieval of raw files and other routes as long as your lambda cold start time is insignificant. And that’s why Python is the perfect candidate.

Now you must have noticed that I’m using two functions there: fetch_html and transform_state. I will be discussing how to implement these in the following sections and they are just a few lines each.

The header formatting function is a straightforward conversion to Lambda header format:

def structure_headers(headers):
    for key, value in headers.items():
    headers[key] = [{“key”: key, “value”: value}]
    return headers

Fetching the HTML

It’s quite easy to retrieve HTML from S3 in our Lambda. But keep in mind that for the code below to work your Lambda’s IAM role must be assigned the right S3 access policy to access this bucket.

import boto3s3_client = boto3.client('s3')def fetch_html(bucket, index_key="index.html"):
    response = s3_client.get_object(Bucket=bucket, Key=index_key)
    return {
        "html": response["Body"].read().decode('utf-8'),
        "headers": response["ResponseMetadata"]["HTTPHeaders"],
        "status":  response["ResponseMetadata"]["HTTPStatusCode"]
    }

Transformer Functions

The next step would be writing a few functions that would make it easy to inject and/or replace SEO tags.

import redef replace_meta_value(state, id_attrib, id_value, new_value):
    state["html"] = re.sub(f'<meta {id_attrib}=["\']*{id_value}["\']*[^>]+>', f'<meta {id_attrib}="{id_value}" content="{new_value}">', state["html"])
    
def inject_meta_field(state, field):
    state["html"] = re.sub('<head>', f'<head>{field}', state["html"])
    
def replace_tag(state, tag, replacment):
    # This regex covers only basic tags with no attributes (e.g. <title>Title</title>)
    state["html"] = re.sub(f"<{tag}>[^<]*</{tag}>", replacment, state["html"])

Now if you would like to use something like this to replace pre-rendering, you’d only need to parse a JSON file in your project that looks something like this:

[{
  "pattern": "/about",
  "transformations": [{
    "name": "inject_meta_field",
    "params": {"field": "<meta name=description value=\"This is the about section!\"/>"}
   },
   {
    "name": "replace_meta_value",
    "params":{"id_attrib": "property",
    "id_value": "og:title",
    "new_value": "About"}
   }]
 },
 {
  "pattern": "/services",
  "transformations": [{
   "name": "replace_meta_value",
   "params": {"id_attrib": "property",
   "id_value": "og:title",
   "new_value": "Services"}
  }]
 }]

Alternatively, you can keep that mapping readily available in your Lambda code. This dictionary would be instantly available without accessing S3 again and you can put it in a python file to skip the JSON parsing as well! You would also be able to change SEO rules without having to redeploy your app, just update the file in your Lambda and redeploy it.

Transforming HTML

It’s easy to map that JSON/dictionary to your transformer functions like so:

import transformersdef transform_state(state, rule):
    for t in rule["transformations"]:
        apply_trans(t, state)
    
def apply_trans(t, state):
    t_func = getattr(transformers, t["name"])
    t_func(state, **t["params"])

By including the whole state in each transformation we also allow those functions to modify response headers and/or status code.

Dynamic SEO

Now you might be wondering if this just achieves exactly what pre-rendering does then why go through the trouble? Well, there are two reasons to do so:

You don’t want to implement pre-rendering! (there’s a loading time impact though)
You want dynamic SEO for search pages, product pages, articles, …etc.

For example, this is a simple API call to a public endpoint:

import json
from urllib.request import urlopendef get_article(slug):
    url = f”https://api.example.com/api/v1/articles/{slug}"
    response = urlopen(url)
    return json.loads(response.read())

Then you can inject whatever attributes you want from that object. You can even define that as one of the helper functions and just place it in the transformations of the right route pattern.

def inject_article_info(state):
    article = get_article(slug)
    state["html"] = replace_meta_value(state, "property", "og:title", article["title"])
    ....

Although for that to work as a transformer function, you will either have to define the slug globally (not clean) or you can just include the request in your state.

...
state = fetch_html("your.bucket.name")
state["request"] = request
...

Then use it in your transformer:

def inject_article_info(state):
    slug = re.search("/article/([a-zA-Z0-9-]+)", state["request"]["uri"]).group(1)
    article = get_article(slug)
    state["html"] = replace_meta_value(state, "property", "og:title", article["title"])
    ....

Further Improvements

Even though this uses lower resources, the impact on TTFB will still be felt when using it on Viewer Request. One way I could think of to mitigate that is to cache the HTML for the lifetime of the environment. Global variables in a Lambda can be used as a cache for when multiple requests are made in succession and the environment isn’t destroyed.

import boto3
from time import time
from copy import deepcopycached_state = None
cache_timestamp = None
cache_ttl = 60def fetch_html(bucket, index_key="index.html"):
    global cached_state
    global cache_timestamp
    now = time()
    if cached_state and now - cache_timestamp < cache_ttl:
        return deepcopy(cached_state)
    else:
        s3_client = boto3.client('s3')
        response = s3_client.get_object(Bucket=bucket, Key=index_key)
        state = {
            "html": response["Body"].read().decode('utf-8'),
            "headers": response["ResponseMetadata"]["HTTPHeaders"],
            "status":  response["ResponseMetadata"]["HTTPStatusCode"]
        }
        cached_state = deepcopy(state)
        cache_timestamp = now
        return state

I’m sure there’s a way to avoid rebuilding the whole HTML string when doing regex replacements. This can become costly with a rather big file or with a lot of transformations. One way I can think of to mitigate this issue is recording the regex match start/end in an html_changes key of the state, then running a smart algorithm to do all replacements/deletions at once.

Final Words

I started a GitHub repository with the base code discussed here. Contributors are welcome to add generic HTML transformations or suggest improvements. There’s also room for performance optimization and adding unit tests.

I will need one maintainer to collaborate on this project so if you are interested, please contact me.

If you liked this article make sure to follow to get more of the same. I will also try my best to respond to comments seeking help. Happy coding!