Using .htaccess to Prevent Web Scraping
If you face the problem of others scraping content from one of your websites, there are many ways of detecting web scrapers — Google Webmaster Tools and Feedburner to name a few tools.
In this article, we will discuss a few ways to make the lives of these scrapers difficult, using .htaccess files in Apache.
An .htaccess (hypertext access) file is a plain text configuration file for web servers that overrides the global server settings for the directory where the file is placed. They can be innovatively used to prevent web scraping.
Before we discuss the specific methods, let me clear up one small fact: If something is publicly available, it can be scraped. The steps that we discuss here can only make things more difficult, not impossible. However, what would you do if someone is smart enough to bypass all your filters? We have a solution for that too.
Getting Started with .htaccess
Since the use of .htaccess files involves Apache checking and reading all .htaccess files on every request, it is generally turned off by default. There are different processes to enable it in Ubuntu, OS X and Windows. Your .htaccess files will be interpreted by Apache only after you enable them, or they will be simply ignored.Next, in most of our use cases, we will be using the
RewriteEngine
of Apache, which is a part of the mod_rewrite
module. If necessary, you could check out a detailed guide on how to set up mod_rewrite for Apache or a general guide on .htaccess.Once you have completed these, you are ready to proceed with the solutions discussed here on dealing with content scrapers. If you haven’t completed either of these steps successfully, Apache will ignore your .htaccess files or raise an error when you restart it after making changes.
Prevent Hotlinking
If someone scrapes your content, all your inline HTML remains the same. This means that the links to the images that were part of your content (and most probably hosted on your domain) remain the same. If the scraper wishes to put the content on a different website, the image would still link back to the original source. This is called hotlinking. Hotlinking costs you bandwidth because every time someone opens the scraper’s site, your image is downloaded.You can prevent hotlinking by adding the following lines to your .htaccess file.
1
2
3
4
5
6
7
8
9
10
11
| RewriteEngine on RewriteCond % {HTTP_REFERER} !^$ # domains that can link to your content (images here) RewriteCond % {HTTP_REFERER} !^http(s)?: / / (www\.)?mysite.com [NC] # show no image when hotlinked RewriteRule \.(jpg|png|gif)$ – [NC,F,L] # Or show an alternate image # RewriteRule \.(jpg|png|gif)$ http://mysite.com/forbidden_image.jpg [NC,R,L] |
- Switching on
RewriteEngine
gives us the ability to redirect the user’s request. RewriteCond
specifies which requests should be redirected.%{HTTP_REFERER}
is the variable that contains the domain from which the request was made.- Then we match it with our own domain
mysite.com
. We add(www\.)
to ensure requests from bothmysite.com
andwww.mysite.com
are allowed. Similarly, our code covershttp
andhttps
. - Next, we check if a
jpg
,png
, orgif
file was requested, and either show an error or redirect the request to an alternate image. NC
ignores the case,F
shows a403 Forbidden
error,R
redirects the request, andL
stops rewriting.- Note that you should apply only one of the rules above (either the
403
error or the alternate image). This is because as soon asL
is encountered, Apache would not apply any other rules. In the code example above, the alternate image method is commented out.
How Can Web Scrapers Bypass This?
One way for a web scraper to bypass such a hurdle is to download images as it encounters them in the HTML code. In such a case, a regular expression check can be applied, the images downloaded, and the links of the images changed while storing the data in the system.Allow or Block Requests From Specific IP Addresses
If you happen to determine the origin of the requests of the web scraper (usually, it’s an unnaturally high number of requests from the same IP address), you can block requests from that IP address.
1
2
| Order Deny Deny from xxx.xxx.xxx.xxx |
xxx.xxx.xxx.xxx
with the IP address you want to block. If you are really paranoid about
security, you could deny requests from all IP addresses and selectively
allow from a whitelist of IP addresses:
1
2
3
4
5
| order deny,allow Deny from all # IP Address whitelist allow from xx.xxx.xx.xx allow from xx.xxx.xx.xx |
wp-admin
directory. In such a case, you would allow requests from only your IP
address, eliminating the possibility of someone hacking your site via
wp-admin.How Can Web Scrapers Bypass This?
If a web scraper has access to proxies, it could distribute its requests over the list of IP addresses to avoid abnormal activity from one IP address.To explain: Let’s say someone is scraping your site from IP address 1.1.1.1. So you block 1.1.1.1 using .htaccess. Now, if the scraper has access to a proxy server 2.2.2.2, it routes its request through 2.2.2.2, so it appears to your server that the request is coming from 2.2.2.2. So, in spite of blocking 1.1.1.1, the scraper is still able to access the resource.
Thus, if the scraper has access to thousands of these proxies, it can become undetectable if it sends requests in low numbers from each proxy.
Redirect Requests From an IP Address
You can not only block any IP address, you can redirect them to a different page too:
1
2
| RewriteCond % {REMOTE_ADDR} xxx\.xxx\.xxx\. RewriteRule . * http: / / mysite.com [R,L] |
Web scraping is a systematic procedure. It involves studying URL patterns and sending requests to all possible pages on a website. If you are a WordPress user, for instance, the URL pattern is
http://mysite.com/?p=[page_no]
, where you increment page_no
from 1 to a large number.What you could do is create a page especially for redirection that redirects the request to one out of a number of predefined pages:
1
2
| RewriteCond % {REMOTE_ADDR} xxx\.xxx\.xxx\. RewriteRule . * http: / / mysite.com / redirection_page [R,L] |
Alternately, “redirection_page” can redirect to a third page “redirection_page_1″, which would then redirect back to “redirection_page”. This would lead to a redirect loop, and a request would get bounced back between the two pages indefinitely.
How Can Web Scrapers Bypass This?
A web scraper could check for redirection of the request. If there is a redirect, it would get a301
or 302
HTTP status code. If there was no redirection, it would get the normal 200
status code.Matt Cutts to the Rescue
Matt Cutts is the head of the web spam team at Google. Part of his job is to be on constant lookout for scraping sites. If he doesn’t like your website, he can make it vanish from Google’s search results. The recent Panda and Penguin updates to Google’s search algorithm have affected a huge number of sites, including a number of scraper sites.A webmaster can report scraper sites to Google using this form, providing the source of the content. If you produce original content, you would definitely be on the radar of web scrapers. Yet, if they re-publish your content, Google will make sure that they are omitted from its search results.
0 comments: