Web Scrapping, the challenges

Published by

on


Web scrapping is a technique of gathering information and data out of a website. Even though copying something manually from a website is consider web-scrapping; the term usually refers to the process of extracting data using a bot or web-crawler. For e.g. if you want to download the lyrics of all the songs of The Beatles. Instead of manually searching for every songs on the web and copying the lyrics, you could easily write a scrapping program to download you the lyrics.

Even though the benefits of web scrapping are numerous, there are challenges which it comes with. The challenges are both technical and non-technical. In this blog, I will try to highlight some of these points along with my lessons-learnt which one try to figure while developing and designing web scrapping applications. I will also try to keep the code blocks to minimum so that it is easier for everyone to understand.

Variety

No two website are the same. Due to this, it becomes very difficult to build a generic solution. Every website needs to be treated uniquely.

Technically if you are scrapping a web-site using BeautifulSoup or JSoup, you are required to provide a html/css tags like below.

response.find('div',{'class':'song1-lyrics-details'}).find('p').text

The CSS/HTML class name will not be the same all throughout the web site. Every developer will have its own style to develop the same piece of code.

Durability

Websites constantly changes. With the fast moving world, it is more than ever important to continuously change and improve the user experience of a website. This brings us to the second challenges of web-scrapping – Durability. Even if you have worked hard to write a web-scrapping program that works on one website, there is a huge chance for the program not working the second time. You could end up with a big block of error code stating missing tags or html/css class.

Lesson learnt: I was working on a web-scrapping project, one of my personal hard-learnt lesson is to always try to error handle each and every tag scrapping. At-least this would ensure the program doesn’t fail abruptly in the middle.

Legal Rights

It’s always good not to violate any terms of conditions or rights whether be it in web scrapping or any other solutions. Since scrapping web is eventually all about gather information, its always advisable to check the terms and conditions. To read more on this, you could check out the Legal Perspective of Web Scrapping from the Modern Web article.

Overwhelming

One should take in consideration the volume of requests to a website while web-scrapping. It pretty easy to overwhelm the web application with requests. When you write something as below:

response = requests.get(url)     ## url = web application link

This essentially means you are trying to request server to send you a response. Now imagine having a scenario to request 1000s of child links of the same websites and scrap every single of them. This scenario might result in website servers getting overwhelmed with large number of request. If the architecture of the website is not scalable enough, requesting 1000s of request in a short of time will result in timeout issues and eventually crashing the website server.

Lesson Learnt: Try to limit the number of request to the web-page in a second. If you are using loop, try to sleep after a certain number of request calls. A threshold value is 2 mins which I usually follow. A sample code would be something as below:

import time

if total_request_count in [100,200,300]:  
       time.sleep(120)  ## sleep for 120 sec if total requests reach 100, 200 or 300
       ... 
       < request the web >

Dynamic Websites

Subscribe to continue reading

Subscribe to get access to the rest of this post and other subscriber-only content.