Journal tags: apache

6

Crawlers

A few months back, I wrote about how Google is breaking its social contract with the web, harvesting our content not in order to send search traffic to relevant results, but to feed a large language model that will spew auto-completed sentences instead.

I still think Chris put it best:

I just think it’s fuckin’ rude.

When it comes to the crawlers that are ingesting our words to feed large language models, Neil Clarke describes the situtation:

It should be strictly opt-in. No one should be required to provide their work for free to any person or organization. The online community is under no responsibility to help them create their products. Some will declare that I am “Anti-AI” for saying such things, but that would be a misrepresentation. I am not declaring that these systems should be torn down, simply that their developers aren’t entitled to our work. They can still build those systems with purchased or donated data.

Alas, the current situation is opt-out. The onus is on us to update our robots.txt file.

Neil handily provides the current list to add to your file. Pass it on:

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: FacebookBot
Disallow: /

In theory you should be able to group those user agents together, but citation needed on whether that’s honoured everywhere:

User-agent: CCBot
User-agent: ChatGPT-User
User-agent: GPTBot
User-agent: Google-Extended
User-agent: Omgilibot
User-agent: FacebookBot
Disallow: /

There’s a bigger issue with robots.txt though. It too is a social contract. And as we’ve seen, when it comes to large language models, social contracts are being ripped up by the companies looking to feed their beasts.

As Jim says:

I realized why I hadn’t yet added any rules to my robots.txt: I have zero faith in it.

That realisation was prompted in part by Manuel Moreale’s experiment with blocking crawlers:

So, what’s the takeaway here? I guess that the vast majority of crawlers don’t give a shit about your robots.txt.

Time to up the ante. Neil’s post offers an option if you’re running Apache. Either in .htaccess or in a .conf file, you can block user agents using mod_rewrite:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (CCBot|ChatGPT|GPTBot|Omgilibot| FacebookBot) [NC]
RewriteRule ^ – [F]

You’ll see that Google-Extended isn’t that list. It isn’t a crawler. Rather it’s the permissions model that Google have implemented for using your site’s content to train large language models: unless you opt out via robots.txt, it’s assumed that you’re totally fine with your content being used to feed their stochastic parrots.

Certbot renewals with Apache

I wrote a while back about switching to HTTPS on Apache 2.4.7 on Ubuntu 14.04 on Digital Ocean. In that post, I pointed to an example .conf file.

I’ve been having a few issues with my certificate renewals with Certbot (the artist formerly known as Let’s Encrypt). If I did a dry-run for renewing my certificates…

/etc/certbot-auto renew --dry-run

… I kept getting this message:

Encountered vhost ambiguity but unable to ask for user guidance in non-interactive mode. Currently Certbot needs each vhost to be in its own conf file, and may need vhosts to be explicitly labelled with ServerName or ServerAlias directories. Falling back to default vhost *:443…

It turns out that Certbot doesn’t like HTTP and HTTPS configurations being lumped into one .conf file. Instead it expects to see all the port 80 stuff in a domain.com.conf file, and the port 443 stuff in a domain.com-ssl.conf file.

So I’ve taken that original .conf file and split it up into two.

First I SSH’d into my server and went to the Apache directory where all these .conf files live:

cd /etc/apache2/sites-available

Then I copied the current (single) file to make the SSL version:

cp yourdomain.com.conf yourdomain.com-ssl.conf

Time to fire up one of those weird text editors to edit that newly-created file:

nano yourdomain.com-ssl.conf

I deleted everything related to port 80—all the stuff between (and including) the VirtualHost *:80 tags:

<VirtualHost *:80>
...
</VirtualHost>

Hit ctrl and o, press enter in response to the prompt, and then hit ctrl and x.

Now I do the opposite for the original file:

nano yourdomain.com.conf

Delete everything related to VirtualHost *:443:

<VirtualHost *:443>
...
</VirtualHost>

Once again, I hit ctrl and o, press enter in response to the prompt, and then hit ctrl and x.

Now I need to tell Apache about the new .conf file:

a2ensite yourdomain.com-ssl.conf

I’m told that’s cool and all, but that I need to restart Apache for the changes to take effect:

service apache2 restart

Now when I test the certificate renewing process…

/etc/certbot-auto renew --dry-run

…everything goes according to plan.

Switching to HTTPS on Apache 2.4.7 on Ubuntu 14.04 on Digital Ocean

I’ve been updating my book sites over to HTTPS:

They’re all hosted on the same (virtual) box as adactio.com—Ubuntu 14.04 running Apache 2.4.7 on Digital Ocean. If you’ve got a similar configuration, this might be useful for you.

First off, I’m using Let’s Encrypt. Except I’m not. It’s called Certbot now (I’m not entirely sure why).

I installed the Let’s Encertbot client with this incantation (which, like everything else here, will need root-level access so if none of these work, retry using sudo in front of the commands):

wget https://dl.eff.org/certbot-auto
chmod a+x certbot-auto

Seems like a good idea to put that certbot-auto thingy into a directory like /etc:

mv certbot-auto /etc

Rather than have Certbot generate conf files for me, I’m just going to have it generate the certificates. Here’s how I’d generate a certificate for yourdomain.com:

/etc/certbot-auto --apache certonly -d yourdomain.com

The first time you do this, it’ll need to fetch a bunch of dependencies and it’ll ask you for an email address for future reference (should anything ever go screwy). For subsequent domains, the process will be much quicker.

The result of this will be a bunch of generated certificates that live here:

/etc/letsencrypt/live/yourdomain.com/cert.pem
/etc/letsencrypt/live/yourdomain.com/chain.pem
/etc/letsencrypt/live/yourdomain.com/privkey.pem
/etc/letsencrypt/live/yourdomain.com/fullchain.pem

Now you’ll need to configure your Apache gubbins. Head on over to…

cd /etc/apache2/sites-available

If you only have one domain on your server, you can just edit default.ssl.conf. I prefer to have separate conf files for each domain.

Time to fire up an incomprehensible text editor.

nano yourdomain.com.conf

There’s a great SSL Configuration Generator from Mozilla to help you figure out what to put in this file. Following the suggested configuration for my server (assuming I want maximum backward-compatibility), here’s what I put in.

Make sure you update the /path/to/yourdomain.com part—you probably want a directory somewhere in /var/www or wherever your website’s files are sitting.

To exit the infernal text editor, hit ctrl and o, press enter in response to the prompt, and then hit ctrl and x.

If the yourdomain.com.conf didn’t previously exist, you’ll need to enable the configuration by running:

a2ensite yourdomain.com

Time to restart Apache. Fingers crossed…

service apache2 restart

If that worked, you should be able to go to https://yourdomain.com and see a lovely shiny padlock in the address bar.

Assuming that worked, everything is awesome! …for 90 days. After that, your certificates will expire and you’ll be left with a broken website.

Not to worry. You can update your certificates at any time. Test for yourself by doing a dry run:

/etc/certbot-auto renew --dry-run

You should see a message saying:

Processing /etc/letsencrypt/renewal/yourdomain.com.conf

And then, after a while:

** DRY RUN: simulating 'certbot renew' close to cert expiry
** (The test certificates below have not been saved.)
Congratulations, all renewals succeeded.

You could set yourself a calendar reminder to do the renewal (without the --dry-run bit) every few months. Or you could tell your server’s computer to do it by using a cron job. It’s not nearly as rude as it sounds.

You can fire up and edit your list of cron tasks with this command:

crontab -e

This tells the machine to run the renewal task at quarter past six every evening and log any results:

15 18 * * * /etc/certbot-auto renew --quiet >> /var/log/certbot-renew.log

(Don’t worry: it won’t actually generate new certificates unless the current ones are getting close to expiration.) Leave the cronrab editor by doing the ctrl o, enter, ctrl x dance.

Hopefully, there’s nothing more for you to do. I say “hopefully” because I won’t know for sure myself for another 90 days, at which point I’ll find out whether anything’s on fire.

If you have other domains you want to secure, repeat the process by running:

/etc/certbot-auto --apache certonly -d yourotherdomain.com

And then creating/editing /etc/apache2/sites-available/yourotherdomain.com.conf accordingly.

I found these useful when I was going through this process:

That last one is good if you like the warm glow of accomplishment that comes with getting a good grade:

For extra credit, you can run your site through securityheaders.io to harden your headers. Again, not as rude as it sounds.

You know, I probably should have said this at the start of this post, but I should clarify that any advice I’ve given here should be taken with a huge pinch of salt—I have little to no idea what I’m doing. I’m not responsible for any flame-bursting-into that may occur. It’s probably a good idea to back everything up before even starting to do this.

Yeah, I definitely should’ve mentioned that at the start.

Homebrew header hardening

I’m at Homebrew Website Club. I figured I’d use this time to document some tweaking I’ve been doing to the back end of my website.

securityheaders.io is a handy site for testing whether your website’s server is sending sensible headers. Think of it like SSL Test for a few nitty-gritty details.

adactio.com was initially scoring very low, but the accompanying guide to hardening your HTTP headers meant I was able to increase my ranking to acceptable level.

My site is running on an Apache server on an Ubuntu virtual machine on Digital Ocean. If you’ve got a similar set-up, this might be useful…

I ssh’d into my server and went to this folder in the Apache directory

cd /etc/apache2/sites-available

There’s a file called default-ssl.conf that I need to edit (my site is being served up over HTTPS; if your site isn’t, you should edit 000-default.conf instead). I type:

nano default-ssl.conf

Depending on your permissions, you might need to type:

sudo nano default-ssl.conf

Now I’m inside nano. It’s like any other text editor you might be used to using, if you imagined what it would be like to remove all the useful features from it.

Within the <Directory /var/www/> block, I add a few new lines:

<IfModule mod_headers.c>
  Header always set X-Xss-Protection "1; mode=block"
  Header always set X-Frame-Options "SAMEORIGIN"
  Header always set X-Content-Type-Options "nosniff"
</IfModule>

Those are all no-brainers:

Enable protection against cross-site-scripting.
Don’t allow your site to be put inside a frame.
Don’t allow anyone to change the content-type headers of your files after they’ve been sent from the server.

If you’re serving your site over HTTPS, and you’re confident that you don’t have any mixed content (a mixture of HTTPS and HTTP), you can add this line as well:

Header always set Content-Security-Policy "default-src https: data: 'unsafe-inline' 'unsafe-eval'"

To really up your paranoia (and let’s face it, that’s what security is all about; justified paranoia), you can throw this in too:

Header unset Server
Header unset X-Powered-By

That means that your server will no longer broadcast its intimate details. Of course, I’ve completely reversed that benefit by revealing to you in this blog post that my site is running on Apache on Ubuntu.

I’ll tell you something else too: it’s powered by PHP. There’s some editing I did there too. But before I get to that, let’s just finish up that .conf file…

Hit ctrl and o, then press enter. That writes out the file you’ve edited. Now you can leave nano: press ctrl and x.

You’ll need to restart Apache for those changes to take effect. Type:

service apache2 restart

Or, if permission is denied:

sudo service apache2 restart

Now, about that PHP thing. Head over to a different directory:

cd /etc/php5/fpm

Time to edit the php.ini file. Type:

nano php.ini

Or, if you need more permissions:

sudo nano php.ini

It’s a long file, but you’re really only interested in one line. A shortcut to finding that line is to hit ctrl and w (for “where is?”), type expose, and hit enter. That will take you to the right paragraph. If you see a line that says:

expose_php = On

Change it to:

expose_php= Off

Save the file (ctrl and o, enter) then exit nano (ctrl and x).

Restart Apache:

service apache2 restart

Again, you might need to preface that with sudo.

Alright, head on back to securityheaders.io and see how your site is doing now. You should be seeing a much better score.

There’s one more thing I should be doing that’s preventing me from getting a perfect score. That’s Public Key Pinning. It sounds a bit too scary for a mere mortal like me to attempt. Or rather, the consequences of getting it wrong (which I probably would), sound too scary.

Inlining critical CSS for first-time visits

After listening to Scott rave on about how much of a perceived-performance benefit he got from inlining critical CSS on first load, I thought I’d give it a shot over at The Session. On the chance that this might be useful for others, I figured I’d document what I did.

The idea here is that you can give a massive boost to the perceived performance of the first page load on a site by putting the most important CSS in the head of the page. Then you cache the full stylesheet. For subsequent visits you only ever use the external stylesheet. So if you’re squeamish at the thought of munging your CSS into your HTML (and that’s a perfectly reasonable reaction), don’t worry—this is a temporary workaround just for initial visits.

My particular technology stack here is using Grunt, Apache, and PHP with Twig templates. But I’m sure you can adapt this for other technology stacks: what’s important here isn’t the technology, it’s the thinking behind it. And anyway, the end user never sees any of those technologies: the end user gets HTML, CSS, and JavaScript. As long as that’s what you’re outputting, the specifics of the technology stack really don’t matter.

Generating the critical CSS

Okay. First question: how do you figure out which CSS is critical and which CSS can be deferred?

To help answer that, and automate the task of generating the critical CSS, Filament Group have made a Grunt task called grunt-criticalcss. I added that to my project and updated my Gruntfile accordingly:

grunt.initConfig({
    // All my existing Grunt configuration goes here.
    criticalcss: {
        dist: {
            options: {
                url: 'http://thesession.dev',
                width: 1024,
                height: 800,
                filename: '/path/to/main.css',
                outputfile: '/path/to/critical.css'
            }
        }
    }
});

I’m giving it the name of my locally-hosted version of the site and some parameters to judge which CSS to prioritise. Those parameters are viewport width and height. Now, that’s not a perfect way of judging which CSS matters most, but it’ll do.

Then I add it to the list of Grunt tasks:

// All my existing Grunt tasks go here.
grunt.loadNpmTasks('grunt-criticalcss');

grunt.registerTask('default', ['sass', etc., 'criticalcss']);

The end result is that I’ve got two CSS files: the full stylesheet (called something like main.css) and a stylesheet that only contains the critical styles (called critical.css).

Cache-busting CSS

Okay, this is a bit of a tangent but trust me, it’s going to be relevant…

Most of the time it’s a very good thing that browsers cache external CSS files. But if you’ve made a change to that CSS file, then that feature becomes a bug: you need some way of telling the browser that the CSS file has been updated. The simplest way to do this is to change the name of the file so that the browser sees it as a whole new asset to be cached.

You could use query strings to do this cache-busting but that has some issues. I use a little bit of Apache rewriting to get a similar effect. I point browsers to CSS files like this:

<link rel="stylesheet" href="/css/main.20150310.css">

Now, there isn’t actually a file named main.20150310.css, it’s just called main.css. To tell the server where the actual file is, I use this rewrite rule:

RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^(.+).(d+).(js|css)$ $1.$3 [L]

That tells the server to ignore those numbers in JavaScript and CSS file names, but the browser will still interpret it as a new file whenever I update that number. You can do that in a .htaccess file or directly in the Apache configuration.

Right. With that little detour out of the way, let’s get back to the issue of inlining critical CSS.

Differentiating repeat visits

That number that I’m putting into the filenames of my CSS is something I update in my Twig template, like this (although this is really something that a Grunt task could do, I guess):

{% set cssupdate = '20150310' %}

Then I can use it like this:

<link rel="stylesheet" href="/css/main.{{ cssupdate }}.css">

I can also use JavaScript to store that number in a cookie called csscached so I’ll know if the user has a cached version of this revision of the stylesheet:

<script>
document.cookie = 'csscached={{ cssupdate }};expires="Tue, 19 Jan 2038 03:14:07 GMT";path=/';
</script>

The absence or presence of that cookie is going to be what determines whether the user gets inlined critical CSS (a first-time visitor, or a visitor with an out-of-date cached stylesheet) or whether the user gets a good ol’ fashioned external stylesheet (a repeat visitor with an up-to-date version of the stylesheet in their cache).

Here are the steps I’m going through:

First of all, set the Twig cssupdate variable to the last revision of the CSS:

{% set cssupdate = '20150310' %}

Next, check to see if there’s a cookie called csscached that matches the value of the latest revision. If there is, great! This is a repeat visitor with an up-to-date cache. Give ‘em the external stylesheet:

{% if _cookie.csscached == cssupdate %}
<link rel="stylesheet" href="/css/main.{{ cssupdate }}.css">

If not, then dump the critical CSS straight into the head of the document:

{% else %}
<style>
{% include '/css/critical.css' %}
</style>

Now I still want to load the full stylesheet but I don’t want it to be a blocking request. I can do this using JavaScript. Once again it’s Filament Group to the rescue with their loadCSS script:

 <script>
    // include loadCSS here...
    loadCSS('/css/main.{{ cssupdate }}.css');

While I’m at it, I store the value of cssupdate in the csscached cookie:

    document.cookie = 'csscached={{ cssupdate }};expires="Tue, 19 Jan 2038 03:14:07 GMT";path=/';
</script>

Finally, consider the possibility that JavaScript isn’t available and link to the full CSS file inside a noscript element:

<noscript>
<link rel="stylesheet" href="/css/main.{{ cssupdate }}.css">
</noscript>
{% endif %}

And we’re done. Phew!

Here’s how it looks all together in my Twig template:

{% set cssupdate = '20150310' %}
{% if _cookie.csscached == cssupdate %}
<link rel="stylesheet" href="/css/main.{{ cssupdate }}.css">
{% else %}
<style>
{% include '/css/critical.css' %}
</style>
<script>
// include loadCSS here...
loadCSS('/css/main.{{ cssupdate }}.css');
document.cookie = 'csscached={{ cssupdate }};expires="Tue, 19 Jan 2038 03:14:07 GMT";path=/';
</script>
<noscript>
<link rel="stylesheet" href="/css/main.{{ cssupdate }}.css">
</noscript>
{% endif %}

You can see the production code from The Session in this gist. I’ve tweaked the loadCSS script slightly to match my preferred JavaScript style but otherwise, it’s doing exactly what I’ve outlined here.

The result

According to Google’s PageSpeed Insights, I done good.

Hacky holidays on OS X

Christmas is a time for giving, a time for over-indulgence, a time for lounging around and for me, a time for doing those somewhat time-consuming tasks that I’d otherwise never get around to doing… like upgrading my operating system.

I used the downtime here in Arizona to install Leopard on my Macbook. I knew from reading other people’s reports that it might take some time to get my local web server back up and running. Sure enough, I had to jump through some hoops.

I threw caution to the wind and chose the “upgrade” option. Normally I’d choose “Archive and Install” but it sounds like this caused some problems for Roger .

The upgrade went smoothly. Before too long, I had a brand spanking new OS that was similar to the old OS but ever so slightly uglier and slower.

My first big disappointment was discovering that my copy of Photoshop 7 didn’t work at all. Yes, I know that’s a really old version but I don’t do too much image editing on my laptop so it’s always been good enough. I guess I should have done some reading up on compatibility before installing Leopard. Fortunately, I was able to upgrade from Photoshop 7 to Photoshop CS3—I was worried that I might have had to buy a new copy.

But, as I said, the bulk of my time was spent getting my local LAMP constellation back up and running. I did most of my editing in BBEdit—if you install the BBEdit command line tools, you can use the word bbedit in Terminal to edit documents. If you use Textmate, mate is the command you want.

Leopard ships with Apache 2 which manages virtual hosts differently to the previous version. Instead of keeping all the virtual host information in /etc/httpd/httpd.conf (or /etc/httpd/users/jeremy.conf), the new version of Apache stores it in /private/etc/apache2/extra/httpd-vhosts.conf. I fired up Terminal and typed:

bbedit /private/etc/apache2/extra/httpd-vhosts.conf

That file shows a VirtualHost example. After unlocking the file, I commented out the example and added my own info:

<VirtualHost *:80>
   ServerName adactio.dev
   DocumentRoot "/Users/jeremy/Sites/adactio/public_html"
</VirtualHost>

The default permissions are somewhat draconian so to avoid getting 403:Forbidden messages when trying to look at any local sites, I also added these lines to the httpd-vhosts.conf file:

<Directory /Users/*/Sites/>
    Options Indexes Includes FollowSymLinks SymLinksifOwnerMatch ExecCGI MultiViews
    AllowOverride All
    Order allow,deny
    Allow from all
</Directory>

I then saved the file, which required an admin password.

The good news is that Leopard doesn’t mess with the hosts file (located at /private/etc/hosts). That’s where I had listed the same host names I had chosen in the previous file:

127.0.0.1 localhost
127.0.0.1 adactio.dev

But for any of that to get applied, I needed to edit the httpd.conf file:

bbedit /private/etc/apache2/httpd.conf

I uncommented this line:

# Include /private/etc/apache2/extra/httpd-vhosts.conf

While I was in there, I also removed the octothorp from the start of this line:

# LoadModule php5_module libexec/apache2/libphp5.so

That gets PHP up and running. Leopard ships with PHP5 which is A Good Thing.

Going into Systems Preferences, then Sharing and then ticking the Web Sharing checkbox, I started up my web server and was able to successfully navigate to http://adactio.dev/. There I was greeted with an error message informing me that my local site wasn’t able to connect to MySQL.

Do not fear: MySQL is still there. But I needed to do two things:

Tell PHP where to look for the connection socket and
Get MySQL to start automatically on login.

For the first step, I needed a php.ini file to edit. I created this by copying the supplied php.ini.default file:

cd /private/etc
cp php.ini.default php.ini
bbedit php.ini

I found this line:

mysql.default_socket =

…and changed it to:

mysql.default_socket = /private/tmp/mysql.sock

I had previously installed MySQL by following these instructions but now the handy little preference pane for starting and stopping MySQL was no longer working. It was going to be a real PITA if I had to manually start up MySQL every time I restarted my computer so I looked for a way of getting it to start up automatically.

I found what I wanted on the TomatoCheese Blog. Here’s the important bit:

Remove the MySQL startup item (we’ll use the preferred launchd instead):
 sudo rm -R /Library/StartupItems/MYSQLCOM

Also, right-click and remove the MySQL preference pane in System Preferences because we’ll be using the preferred launchd instead.

Copy this MySQL launchd configuration file to /Library/LaunchDaemons, and change its owner to root:
sudo chown root /Library/LaunchDaemons/com.mysql.mysqld.plist

That did the trick for me. When I restarted my machine, MySQL started up automatically.

So after some command line cabalism and Google sleuthing, I had my local webdev environment back up and running on Leopard.