Headless Web Scraping Tutorial Series Using Mechanize ( 2 ) { Starting From Understanding The Basics }

understanding the browser we use to explore web pages in mechanize

As we have already installed mechanize and beautifulsoup. Now we are going to use these modules first time.

First of all we will create an instance of browser from mechanize. Open python idle and start typing …

>>> import mechanize
>>> import cookielib
>>> br = mechanize.Browser()

now we have created a headless browser named ” br ” and now we are going to add some properties to it show it can simulate a real browser and not just like a robot.

>>> cj = cookielib.LWPCookieJar()
>>> br.set_cookiejar(cj)

we have added cookie support to our browser and it will be used to store cookies in sessions. Now we are going to add some more basic properties.

>>> br.set_handle_equiv(True)
>>> br.set_handle_gzip(True)
>>> br.set_handle_redirect(True)
>>> br.set_handle_referer(True)
>>> br.set_handle_robots(False)

Further a user agent string is also required so that websites we are accessing think that we are using it with that browser which we have provided in user_agent string

>>> br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

Some more basic features like debugging and refresh following can also be added like

>>> br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
>>> br.set_debug_http(True)
>>> br.set_debug_redirects(True)
>>> br.set_debug_responses(True)

now we have an instance of a browser to interact  with the web pages and to inspect them. Now see how to use it. We are going to use open function to open a url and retrieving html from that page.

>>> response = br.open('http://google.com')
>>> html = response.read()                 
>>> print html

and yes we can see the pretty source code of the google homepage

In our next part we are going to learn how to extract link from a webpage. Stay in touch and enjoy the headless web scraping.  🤘🤘

 

 

 

Headless Web Scraping Tutorial Series Using Mechanize ( 1 ) { Intro to Mechanize-Python } 

intro to the headless browser mechanize ( lightweight but powerful module)

Welcome to the headless web scraping series. 

This series requires the basic knowledge to the python programming language, web scraping and headless browsing. If you don’t know what are these terms then search through the blog and give it a view. 

Mechanize  

Mechanize is a headless browser. We will use this browser in our tutorial. It is lightweight module very easy to use when once you understand its basics. Keep in mind it is powerful enough to handle almost each type of command we give to a browser ( alaas!  it doesn’t support Javascript but you can check what is happening in the javascript and make a piece of code and execute it ), but finally it is the best when you are going to use it for websites with simple Javascript. So let’s dive into the sea of headless web scraping. 

How To Install

There are two methods for this. First open cmd, go to “Scripts”  directory inside the directory where python is installed by default and execute the following command “ pip install mechanize “. Finally you will see the message ” mechanize is successfully installed ” 

Second method to install mechanize is to downloading the compressed file from python repositry and extract the mechanize folder inside the compressed file in the site-packages directory. 

Finally open a python shell and write “ import mechanize “. If you don’t see any error , cheer, you have successfully installed the mechanize. 

Check the further tutorials and keep scraping 🤘🤘