Bots.lol

How To Scrape Google Search Results Data In Python Easily

Thu, 25 Jan 2024 14:38:35 GMT

Google search engine results pages (SERPs) can provide alot of important data for you and your business but you most likely wouldn't want to scrape it manually. After all, there might be multiple queries you're interested in, and the corresponding results should be monitored on a regular basis. This is where automated scraping comes into play: you write a script that processes the results for you or use a dedicated tool to do all the heavy lifting.

In this article you'll learn how to scrape Google search results with Python. We will discuss three main approaches:

Using the Scrapingbee API to simplify the process and overcome anti-bot hurdles (hassle free)
Using a graphical interface to construct a scraping request (that is, without any coding)
Writing a custom script to do the job

We will see multiple code samples to help you get started as fast as possible.

Shall we get started?

You can find the source code for this tutorial on GitHub.

Why scrape search results?

The first question that might arise is "why in the world do I need to scrape anything?". That's a fair question, actually.

You might be an SEO tool provider and need to track positions for billions of keywords
You might be a website owner and want to check your rankings for a list of keywords regularly.
You might want to perform competitor analysis. The simplest thing to do is to understand how your website ranks versus that other guy's website: in other words, you'll want to assess your competitor's positions for various keywords. .
Also, it might be important to understand what customers are into these days. What are they searching for? What are the modern trends?
If you're a content creator, it will be important for you to analyze potential topics to cover. What your audience would like to read about?
Perhaps, you might need to perform lead generation, monitor certain news, prices, or research and analyze a given field.

In fact, as you can see, there are many reasons to scrape the search results. But while we understand "why", the more important question is "how" which is closely tied to "what are the potential issues". Let's talk about that.

Challenges of scraping Google search results

Unfortunately, scraping Google search results is not as straightforward as one might think. Here are some typical issues you'll probably encounter:

Aren't you a robot, by chance?

I'm pretty sure I'm not a robot (mostly) but for some reason Google keeps asking me this question for years now. It seems he's never satisfied with my answer. If you've seen those nasty "I'm not a robot" checkboxes also known as "captcha" you know what I mean.

So-called "real humans" can pass these checks fairly easily but if we are talking about scraping scripts, things become much harder. Yes, you can think of a way to solve captchas but this is definitely not a trivial task. Moreover, if you fail the check multiple times your IP address might get blocked for a few hours which is even worse. Luckily, there's a way to overcome this problem as we'll see next.

Do you want some cookies?

If you open Google search home page via your browser's incognito mode, chances are you're going to see a "consent" page asking whether you are willing to accept some cookies (no milk though). Until you click one of the buttons it won't be possible to perform any searches. As you can guess, the same thing might happen when running your scraping script. Actually, we will discuss this problem later in this article.

Don't request so much from me!

Another problem happens when you request too much data from Google, and it becomes really angry with you. It might happen when your script sends too many requests too fast, and consequently the service blocks you for a period of time. The simplest solution is to wait, or to use multiple IP addresses, or to limit the number of requests, or... perhaps there's some other way? We're going to find out soon enough!

Lost in data

Even if you manage to actually get some reasonable response from Google, don't celebrate yet. Problem is, the returned HTML data contains lots and lots of stuff that you are not really interested in. There are all kinds of scripts, headers, footers, extra markup, and so on and so forth. Your job is to try and fetch the relevant information from all this gibberish but it might appear to be a relatively complex task on its own.

Problem is, Google tends to use not-so-meaningful tag IDs due to certain reasons, therefore you can't even create reliable rules to search the content on the page. I mean, yesterday the necessary tag ID was yhKl7D (whatever that means) but today it's klO98bn. Go figure.

Beating Google ReCaptcha and the funCaptcha using AWS Rekognition

Tue, 25 Aug 2020 17:11:10 GMT

Project Voight-Kampff

Originally found HERE.

Beating Google's reCaptcha using AWS Rekognition. Part of project Touch-Captcha (두 터치). I did this because I cannot promote a better Captcha without first beating the industry standard.

Nothing special here. Credit goes to the ML researchers who developed the image classification technologies readily available today, either via the Google Vision API or AWS Rekognition.

Voight-Kampff comes from the movie Blade Runner (1982). It is the test used by Blade Runners to tell a Replicant(synthetic human/android) from a human being.

I am doing this because: I like research and I want to get a PhD in Machine Thinking (the inverse of Machine Learning). Your contributions will help me focus solely on this work with minimal distractions from the outside world.

You will need:

Google GCP account (The virtual machines I use are hosted in GCP)
AWS account (Proxies)

Pull with

curl "https://raw.githubusercontent.com/pirates-of-silicon-hills/test/master/setup.sh" --output setup.sh

chmod u+r+x setup.sh

./setup.sh

Past Puzzles

Every puzzle you see in my demonstration videos has been saved as an image with a unique name as identifier. You can download all the images here: https://drive.google.com/open?id=18b0HxyOsLP6AZMpF1-DNITrGvFBGkYND

Good IP vs Bad IP?

Fri, 21 Aug 2020 16:00:00 GMT

In the past I've mentioned "Good IPs" and "Bad IPs". So what makes an IP Bad? Well, it comes down to what are other people doing on that IP? If you're using a cheap/free/crappy VPN or Proxy chances are you're sharing that with bad actors.

Really Bad

This IP shows up on a number of blacklists. You're not able to perform a google search without getting a captcha from google. Your "United States" IP address actually returns a non-US country on a IP whois. Some sites will block this directly and prevent any pages from loading.

Throw it away. Never use that IP again.

Bad

You're not blacklisted. But, the location services are wonky. Sometimes an IP whois will return US results, but google and various websites will target you as somewhere else. One easy way to check is to see if you get ads in a Google search? Will your Google search results be in Vietnamese or some other language than you'd expect?

Results for a "US" based IP.

I don't think we're actually in Kansas.

Good

Your IP doesn't have bad traffic and it returns US results. Probably a good IP. Now, if it's a 'datacenter' IP you might still get extra attention. Most VPNs and cheap proxies use datacenters as they're cheap.

Ads! Yay.

When Good IPs Go Bad

Now, with a VPN IPs rotate and you may not get the same one ever again. Which is kind of the point. If you use a proxy that leases an IP to multiple people you could get it leased with some bad actors. Your good IP may not be good after a few days. It could deteriorate and eventually fall on a blacklist.

A note on datacenters. Datacenters are not bad, but they're also not good. Some people will straight up ban datacenter access or flag you. Other places do not care a ton and treat IPs as good until they've done something bad. This can be used to your advantage as some datacenters don't allow bad actors. A cheap VPS that you spin up and route traffic too could provide enough cover for your tasks. Especially because you know all the traffic taken from that IP. Yes, someone could have done something 'bad' and released it to the pool. But, less likely than a free/cheap proxy. Additionally, some studies have shown that 50% of bots come from datacenter traffic and the rest come from residential or organizational IPs.

One Final Word

A 'Good IP' on one site may be a 'Bad IP' to another site. You could share an IP with a bad actor whose scrapping Amazon prices. No one else will care about that IP but Amazon. Much of this depends on the organization size. Amazon generally does most their stuff in house. So if {bad actor} has been flagged on Amazon your actions on Amazon could be flagged as well. But, smaller sites generally share tools. If ExampleA.com and ExampleB.com use Anti-Scraping.com's services and {bad actor} gets flagged trying to scrape ExampleA.com and you wander over to ExampleB you'll get blocked as well. But, that IP may be fine elsewhere.

Let's Talk Behavior Analysis

Fri, 14 Aug 2020 16:00:00 GMT

These days people are using more behavior analysis and other buzzwords to analyze how people interact with their site. While probably not the first, HotJar is a common and well known tool for tracking the behaviors of users. Common tools include heatmaps that track users mouse movements as well as conversion funnels to see the common flow of users before they either convert or drop off.

What's this have to do with automation? Many companies are also using behavior analysis to find cheaters, bots, or other automation tools. For instance, in video games if your anti-cheat software is beaten and the user is using a speedhack by recording x,y,z locations every period of time you can find players who have exceeded a reasonable speed or are outside of a reasonable boundary. You can isolate those that stand out from the norm and ban them. Additionally, this can also be used to detect bots that follow a very strict path. More information here.

Easy to follow, easy to track.

So what do you do? Well, rather than a strict point to point system a mesh system is harder to detect. Many users farming an area will stick to one specific area. If you map out the whole area and place key points it becomes harder to detect. Rather than going from point A to B to C to D... you can get from point A to D via three different paths. And then maybe you go back to B this time instead of C. It adds some randomness. But, there are still ways to catch it as maybe you always stop and change direction at coord x,y,z. Most of these mesh systems still use key coord points rather than adding any randomness. So at the end of the day you're still doing something very repetitive.

An example of a mesh system. Lots of paths!

But, these are just some of the tools available to catch botters. MMOs will deploy other things such as seeing whose using a fake keyboard to catch users who write AutoIt or AutoHotKey scripts. This has lead to false positives for people who use ADA software. So they likely fixed detection when that process is running. I feel this is also how they're catching some of the more recent fishbots as they use a simulated keyboard. Once again, I haven't botted in a while, but I used a modified version of a popular fishbot that faked a hardware keyboard. I avoided detection.

Behavior Analysis On The Web

Alright, who cares about video games right? I use selenium and wanna bot the web! There are many tools out there that do the same thing. And luckily they are generally open and tell you what they do and how they detect you. Let's talk about Sift for a minute. "Sift prevents fraud with industry-leading technology and expertise, an unrivaled global data network, and a commitment to building long-term partnerships with our customers." They're a group of ex-googlers who have developed some tracking software to find bots and other bad actors. They try to defeat fraud, bot created accounts, and a whole bunch of other things. Let's take a look at what they do.

The code to do their goodies.