Debugging Spiders
Basically this is a simple spider which parses two pages of items (the start_urls). Items also have a details page with additional information, so we use the functionality of Request
to pass a partially populated item.
The most basic way of checking the output of your spider is to use the parse command. It allows to check the behaviour of different parts of the spider at the method level. It has the advantage of being flexible and simple to use, but does not allow debugging code inside a method.
In order to see the item scraped from a specific url:
$ scrapy parse --spider=myspider -c parse_item -d 2 <item_url>
>>> STATUS DEPTH LEVEL 2 <<<
# Scraped Items ------------------------------------------------------------
[]
Checking items scraped from a single start_url, can also be easily achieved using:
$ scrapy parse --spider=myspider -d 3 'http://example.com/page1'
While the command is very useful for checking behaviour of a spider, it is of little help to check what happens inside a callback, besides showing the response received and the output. How to debug the situation when sometimes receives no item?
Fortunately, the shell is your bread and butter in this case (see ):
Sometimes you just want to see how a certain response looks in a browser, you can use the open_in_browser
function for that. Here is an example of how you would use it:
from scrapy.utils.response import open_in_browser
def parse_details(self, response):
open_in_browser(response)
will open a browser with the response received by Scrapy at that point, adjusting the base tag so that images and styles are displayed properly.
Logging is another useful option for getting information about your spider run. Although not as convenient, it comes with the advantage that the logs will be available in all future runs should they be necessary again: