Google search engine results pages (SERPs) can provide alot of important data for you and your business but you most likely wouldn't want to scrape it manually. After all, there might be multiple queries you're interested in, and the corresponding results should be monitored on a regular basis. This is where automated scraping comes into play: you write a script that processes the results for you or use a dedicated tool to do all the heavy lifting.
In this article you'll learn how to scrape Google search results with Python. We will discuss three main approaches:
- Using the Scrapingbee API to simplify the process and overcome anti-bot hurdles (hassle free)
- Using a graphical interface to construct a scraping request (that is, without any coding)
- Writing a custom script to do the job
We will see multiple code samples to help you get started as fast as possible.
Shall we get started?
You can find the source code for this tutorial on GitHub.
Why scrape search results?
The first question that might arise is "why in the world do I need to scrape anything?". That's a fair question, actually.
- You might be an SEO tool provider and need to track positions for billions of keywords
- You might be a website owner and want to check your rankings for a list of keywords regularly.
- You might want to perform competitor analysis. The simplest thing to do is to understand how your website ranks versus that other guy's website: in other words, you'll want to assess your competitor's positions for various keywords. .
- Also, it might be important to understand what customers are into these days. What are they searching for? What are the modern trends?
- If you're a content creator, it will be important for you to analyze potential topics to cover. What your audience would like to read about?
- Perhaps, you might need to perform lead generation, monitor certain news, prices, or research and analyze a given field.
In fact, as you can see, there are many reasons to scrape the search results. But while we understand "why", the more important question is "how" which is closely tied to "what are the potential issues". Let's talk about that.
Challenges of scraping Google search results
Unfortunately, scraping Google search results is not as straightforward as one might think. Here are some typical issues you'll probably encounter:
Aren't you a robot, by chance?
I'm pretty sure I'm not a robot (mostly) but for some reason Google keeps asking me this question for years now. It seems he's never satisfied with my answer. If you've seen those nasty "I'm not a robot" checkboxes also known as "captcha" you know what I mean.
So-called "real humans" can pass these checks fairly easily but if we are talking about scraping scripts, things become much harder. Yes, you can think of a way to solve captchas but this is definitely not a trivial task. Moreover, if you fail the check multiple times your IP address might get blocked for a few hours which is even worse. Luckily, there's a way to overcome this problem as we'll see next.
Do you want some cookies?
If you open Google search home page via your browser's incognito mode, chances are you're going to see a "consent" page asking whether you are willing to accept some cookies (no milk though). Until you click one of the buttons it won't be possible to perform any searches. As you can guess, the same thing might happen when running your scraping script. Actually, we will discuss this problem later in this article.
Don't request so much from me!
Another problem happens when you request too much data from Google, and it becomes really angry with you. It might happen when your script sends too many requests too fast, and consequently the service blocks you for a period of time. The simplest solution is to wait, or to use multiple IP addresses, or to limit the number of requests, or... perhaps there's some other way? We're going to find out soon enough!
Lost in data
Even if you manage to actually get some reasonable response from Google, don't celebrate yet. Problem is, the returned HTML data contains lots and lots of stuff that you are not really interested in. There are all kinds of scripts, headers, footers, extra markup, and so on and so forth. Your job is to try and fetch the relevant information from all this gibberish but it might appear to be a relatively complex task on its own.
Problem is, Google tends to use not-so-meaningful tag IDs due to certain reasons, therefore you can't even create reliable rules to search the content on the page. I mean, yesterday the necessary tag ID was yhKl7D
(whatever that means) but today it's klO98bn
. Go figure.