Web Scraping

Web scraping is not just about doing regex, but more about understanding how the website has been laid out! It is basically a technique to extract large amounts of data from websites and storing them into local or database!

We are using python for it, packages included are, request and beautiful soup. Although, beautiful soup is amazing, it cannot make a request to a website and therefore we need request package. We will be using python3 for this, although, web scraping is part of Python 2 as well! If you are using python3, then following command fails –

pip install bs4

This installs bs4 to the base python that your terminal already has. Since in most cases, the base python installed is 2.7, bs4 gets installed in the python 2 version. To make sure, it gets installed in python3, version of your system, use the following command –

sudo apt-get install python3-bs4

Once bs4 and request have been installed, you got to import them!

import requests

import bs4

Both of the statements import the libraries required.

res = request.get(‘website link’)

The variable res, stores the results of request.get. When we execute this command, the variable res gets the value of the entire web page! By value, we mean, the entire html document for the web page. If you want to see what is stored in res, type in,

res.text

But, the results we get on typing this, are quite parano-ing! The entire console gets filled up and there is nothing much one can understand! This is where BeautifulSoup comes into picture.

soup = bs4.BeautifulSoup(res.text,’lxml’)

Using this command we are now storing the content of the request into an lxml. This makes it easier to fetch. For instance, if you want to check what is the title of the website, all you got to do is –

results = soup.select(‘title’)

The results are like the ones in the image! As you can see this is more like a dictionary. Since we chose title, we got only one result. However, there could be multiple and if one needs to check just the text of the result, the command is –

results[0].getText()



Leave a comment