Journal tags: crawling

1

Permission

Back when the web was young, it wasn’t yet clear what the rules were. Like, could you really just link to something without asking permission?

Then came some legal rulings to establish that, yes, on the web you can just link to anything without checking if it’s okay first.

What about search engines and directories? Technically they’re rifling through all the stuff we publish and reposting snippets of it. Is that okay?

Again, through some legal precedents—but mostly common agreement—everyone decided that on balance it was fine. After all, those snippets they publish are helping your site get traffic.

In short order, search came to rule the web. And Google came to rule search.

The mutually beneficial arrangement persisted uneasily. Despite Google’s search results pages getting worse and worse in recent years, the company’s huge market share of search means you generally want to be in their good books.

Google’s business model relies on us publishing web pages so that they can put ads around the search results linking to that content, and we rely on Google to send people to our websites by responding smartly to search queries.

That has now changed. Instead of responding to search queries by linking to the web pages we’ve made, Google is instead generating dodgy summaries rife with hallucina… lies (a psychic hotline, basically).

Google still benefits from us publishing web pages. We no longer benefit from Google slurping up those web pages.

With AI, tech has broken the web’s social contract:

Google has steadily been manoeuvring their search engine results to more and more replace the pages in the results.

As Chris puts it:

Me, I just think it’s fuckin’ rude.

Google is a portal to the web. Google is an amazing tool for finding relevant websites to go to. That was useful when it was made, and it’s nothing but grown in usefulness. Google should be encouraging and fighting for the open web. But now they’re like, actually we’re just going to suck up your website, put it in a blender with all other websites, and spit out word smoothies for people instead of sending them to your website. Instead.

Ben proposes an update to robots.txt that would allow us to specify licensing information:

Robots.txt needs an update for the 2020s. Instead of just saying what content can be indexed, it should also grant rights.

Like crawl my site only to provide search results not train your LLM.

It’s a solid proposal. But Google has absolutely no incentive to implement it. They hold all the power.

Or do they?

There is still the nuclear option in robots.txt:

User-agent: Googlebot
Disallow: /

That’s what Vasilis is doing:

I have been looking for ways to not allow companies to use my stuff without asking, and so far I coulnd’t find any. But since this policy change I realised that there is a simple one: block google’s bots from visiting your website.

The general consensus is that this is nuts. “If you don’t appear in Google’s results, you might as well not be on the web!” is the common cry.

I’m not so sure. At least when it comes to personal websites, search isn’t how people get to your site. They get to your site from RSS, newsletters, links shared on social media or on Slack.

And isn’t it an uncomfortable feeling to think that there’s a third party service that you absolutely must appease? It’s the same kind of justification used by people who are still on Twitter even though it’s now a right-wing transphobic cesspit. “If I’m not on Twitter, I might as well not be on the web!”

The situation with Google reminds me of what Robin said about Twitter:

The speed with which Twitter recedes in your mind will shock you. Like a demon from a folktale, the kind that only gains power when you invite it into your home, the platform melts like mist when that invitation is rescinded.

We can rescind our invitation to Google.