Data scraping is common these days with so many data driven web applications out there. Regardless of the legality and ethics of the subject, it just sucks to know that there may be people out there who might hammer at your site to “take” all your hard work for their selfish use. That being said, it’s almost impossible to prevent people from taking your data, especially if your data is easily accessible by the public.
You’re probably here because you’re suspicious of possible data scraping activities happening in your website. While there are preventive measures you can take to keep the scrapers out, there are always holes for the smart ones to get through and eventually take the data they want. What we want to do is make it very difficult for the average scraper to hack away the data.
Make Your Data “Unselectable”
If the data scraper is copying your data by manually typing out the web content, there’s nothing you can do.
Add Fake Data
If your data is nicely organized and structured within tables or blocks, and probably marked with classes and ids using css, it is very easy for script programmers to scrape data. You can do the following to make the task difficult for them:
- Add random useless hidden tables/blocks of data: Use the structures you use to display your data. Make them hidden to a regular user using css, that way it will not be visible to the user but the lines of code should show for any data scrapers.
- Add gibberish within the class and id properties of the html tags to make it harder to understand them: Most people code in a way to make it very readable, e.g. <div class=”car”>Acura</div>. Make it harder to understand them by adding gibberish classes and shifting blocks of code without breaking them.
- Add fake links to fake data: some data scrapers go as far as grabbing the link associated with the related data to grab further information on the subject. Use bogus links.
Using ajax has its benefits, but keeping data scrapers away isn’t one of them. In fact, using ajax is arguably the easiest way for data scrapers to mine data, depending on the formatting of the responses the ajax uses. Some people use ajax requests to retrieve data in the form of json or some form of array data structure. This is by far the easiest for the average programmer to parse, because the data already comes structured the way they want. Make it harder for them by having your ajax requests return a block of html code ready for display, or at least randomly structure the data so that it won’t be easily interpreted.
Make a Dummy Page
For the majority of the time, sites will have too much data to list all in one page, which makes pagination a useful feature. This forces data scrapers to grab pages usually by numbers, until they reach a page that does not return their expected content. Try to be clever by using a convention where url format changes after a certain page. Or add dummy pages that only data scrapers can access by fooling them into thinking there are more pages than there actually is.
Ban their IP Address
Using some of the techniques listed above, it’s possible to catch script programming data scrapers by having them grab a page that’s not accessible by a regular user using their browser. Once you lead them to that page, you can log their IP address and ban them from gaining access to your site. There’s a chance that they’ll be back using a different IP address but at least you made it an extra step difficult for them.