HTML Scraping

    Web sites are written using HTML, which means that each web page is astructured document. Sometimes it would be great to obtain some data fromthem and preserve the structure while we’re at it. Web sites don’t alwaysprovide their data in comfortable formats such as CSV or JSON.

    This is where web scraping comes in. Web scraping is the practice of using acomputer program to sift through a web page and gather the data that you needin a format most useful to you while at the same time preserving the structureof the data.

    lxml and Requests

    is a pretty extensive library written for parsingXML and HTML documents very quickly, even handling messed up tags in theprocess. We will also be using theRequests module instead of thealready built-in urllib2 module due to improvements in speed and readability.You can easily install both using andpip install requests.

    Let’s start with the imports:

    1. page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
    2. tree = html.fromstring(page.content)

    (We need to use page.content rather than becausehtml.fromstring implicitly expects bytes as input.)

    now contains the whole HTML file in a nice tree structure whichwe can go over two different ways: XPath and CSSSelect. In this example, wewill focus on the former.

    XPath is a way of locating information in structured documents such asHTML or XML documents. A good introduction to XPath is on .

    There are also various tools for obtaining the XPath of elements such asFireBug for Firefox or the Chrome Inspector. If you’re using Chrome, youcan right click an element, choose ‘Inspect element’, highlight the code,right click again, and choose ‘Copy XPath’.

    Knowing this we can create the correct XPath query and use the lxmlxpath function like this:

    1. #This will create a list of buyers:
    2. #This will create a list of prices
    3. prices = tree.xpath('//span[@class="item-price"]/text()')

    Let’s see what we got exactly:

    1. Buyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes',
    2. 'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff',
    3. 'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup',
    4. 'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']
    5.  
    6. Prices: ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25',
    7. '$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11',
    8. '$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68',

    Congratulations! We have successfully scraped all the data we wanted froma web page using lxml and Requests. We have it stored in memory as twolists. Now we can do all sorts of cool stuff with it: we can analyze itusing Python or we can save it to a file and share it with the world.

    Some more cool ideas to think about are modifying this script to iteratethrough the rest of the pages of this example dataset, or rewriting thisapplication to use threads for improved speed.