Marketing the TNW Way #5: SEO, running Screaming Frog in the cloud

The Next Web SEO

In this series of articles, I’d love to shed some light on how The Next Web approaches marketing through Web analytics, Search Engine Optimization (SEO), Conversion Rate Optimization (CRO), social media and more. This time around, we’re focusing on SEO and how we analyze and audit a website the size of The Next Web.

Why do we run SEO audits?

We publish 30-40 articles a day and try to improve the site the best we can. It’s becoming an increasing challenge to analyze all the pages we have online, but sometimes we can’t always abide by the guidelines and best practices of our big friend, Google.

Before we dive into the data, analyze problems and get the website into the best shape we can, we need to know what is happening to the site in real time. Therefore we need to crawl everything.

The main purpose of these audits are:

Accessibility and Indexability

Is our website accessible for search engines, or are there pages we don’t want to be found? If they can’t find them, nobody will.

Technical errors

There are many factors search engines use to rank result pages. By crawling all pages, from domain to page level, you can spot possibilities to optimize for SEO more quickly.

After the crawl you still need to do the most important work, analyzing the data and turn it into an actionable plan. But let’s start at the beginning, the setup.

Why Screaming Frog?

Site audits can be very time-consuming. The biggest challenge we came across was that we wanted to have all URLs visible, but that was a much bigger task than we expected. However, we found Screaming Frog SEO Spider was more than up for the task.

An all-round tool which allows you to find broken links, check for Google Analytics (or any other) code on all pages, monitor all the redirects and find out the redirect paths in a website. But it also has its limitations.

The amount of URLs you are able to crawl is directly linked with the RAM capacity on you computer. My laptop only has 4GB of RAM, which meant I wasn’t able to crawl in its entirety. Things ground to a halt around the 40,000 URL mark, but the site actually has nearly 200,000. So we had a problem.

We looked into the alternatives that enabled us to keep down costs while maintaining the same level of flexibility we had before. We looked into a few cloud tools and since we’d like to use the data in an easy way and still be able to use the raw data at some point we decided that Screaming Frog was the way to go.

Still, the problem of not being able to monitor more URLs was still there.

Yes, it’s possible to get yourself a monster of a computer, but we decided to go to the Google Cloud Platform and run it virtually. The biggest source of information we used for it was this article (thanks for that @FiliWiese).

What is the next step that we’d like to take to get more data on our SEO performance?

We’re going to make sure that we can leverage the data from Screaming Frog even better so we can monitor our changes and performance over time. Hopefully in six months we’ll be able to blog more about this. We’re also looking into making this process more automated to save time and costs for running the server.

How you can run this?

You don’t need to be a complete geek to get how this works. So here’s the complete set up on how to run ScreamingFrog in the cloud:

  1. On your local machine, make sure you have installed the Google Cloud SDK (you’ll need it in order to create and set up a Compute Engine instance).
  2. Create a new project in the Google Developers Console, you’ll need the keys there later.
  3. Once you’ve created the project make sure to enable the APIs that are related to the Google Cloud. You’ll them have enabled it to have permission to create an instance and connect to it.
  4. Create a new instance.
    • Machine type: In our case we chose for 8 vCPUs (30GB memory) to make sure we could “speed things up”.
    • Boot disk: You probably want to allocate some more hard disk size to it than the default 10. We changed it to 50 for now.
    • Once creating the instance you’ll get a notification that you’ll be billed. You only pay for the hours the machine is running so you can stop it at anytime.The set up from this point on is mostly done in Terminal, the commands are visible below each step.
  5. Install a VNC on your VM instance.
    • Open Terminal
    • Log in to Google Developers Console
      gcloud auth login
    • Verify your account
    • Enter the root in Terminal
      sudo -s
    • Update the software packages
      apt-get update
    • Install necessary programs
      apt-get install tightvncserver xfce4 xfce4-goodies xdg-utils openjdk-6-jre software-properties-common python-software-properties
    • Add user
      adduser vnc
    • Switch to new user
      su vnc
    • Set new password
  6. Setup startup scripts so the VNC server will turn on once you turn on the VM
    • Download the scripts
      wget -O /etc/init.d/vncserver
      wget -O /home/vnc/.vnc/xstartup
    • Apply configuration settings
      chown -R vnc. /home/vnc/.vnc && chmod +x /home/vnc/.vnc/xstartup
      -i 's/allowed_users.*/allowed_users=anybody/g' /etc/X11/Xwrapper.config
      +x /etc/init.d/vncserver
    • Reboot
    • Start VNC server
      update-rc.d vncserver defaults
      service vncserver start
  7. Install Screaming Frog on your VM instance
    • Download Java
      echo "deb trusty main" | tee /etc/apt/sources.list.d/webupd8team-java.list
      "deb-src trusty main" | tee -a /etc/apt/sources.list.d/webupd8team-java.list
      -key adv --keyserver hkp:// --recv-keys EEA14886
      -get update
      -get install oracle-java8-installer
    • Set Oracle Java as default
      apt-get install oracle-java8-set-default
    • Before installing Screaming frog, add the “ttf-mscorefonts-installer”
      add-apt-repository "deb wheezy main contrib non-free" && apt-get update && apt-get install ttf-mscorefonts-installer
    • Download the latest version from Screaming Frog
    • Install Screaming Frog
      dpkg -i screamingfrogseospider_6.2_all.deb
    • If an error comes up type in this command
      apt-get -f install
  8. In order to connect to the VNC you need a VNC client. There are several online but we are using RealVNC.
  9. Don’t forget to open a port on your instance. Your VNC client is going to need it. You can easily do this via the Developers console where you change your instance. Go to Networks and create a new firewall rule where you enter these allowed protocols/ports: “tcp:5900; tcp:5901; tcp:5902”. This opens up the port for your VNC client to connect.
  10. Open the VNC client, enter the IP address of your instance, add one of the open ports to it so it will look like this: “” and smash the Connect button. You should now have a running Linux desktop in front of you, great!
  11. Run a crawl on ScreamingFrog – if you’re willing to export it to Google Cloud Storage, do the following. Once you export a CSV file of the data in the crawl you should be able to export it to your own buckets via this command in your VNC: cd /home/vnc/Desktop gsutil cp internal_all.csv gs://{your-bucket-name}

Hope you can benefit from this steps and tips.
How are you trying to make your SEO process more innovative and efficient?

If you missed the previous posts in this series, don’t forget to check them out: #1: Heat maps , #2: Deep dive on A/B testing, #3: Learnings from our A/B tests and #4: From Manager to Recruiter.


This is a #TNWLife article, a look into life and work at The Next Web.

Read next: 15-minute Microsoft Paint billboard just convinced me to become a games dev

Shh. Here's some distraction