Jobs: pausing and resuming crawls
Scrapy supports this functionality out of the box by providing the following facilities:
a scheduler that persists scheduled requests on disk
a duplicates filter that persists visited requests on disk
To enable persistence support you just need to define a job directory through the setting. This directory will be for storing all required data to keep the state of a single job (i.e. a spider run). It’s important to note that this directory must not be shared by different spiders, or even different jobs/runs of the same spider, as it’s meant to be used for storing the state of a single job.
To start a spider with persistence support enabled, run it like this:
Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a signal), and resume it later by issuing the same command:
Here’s an example of a callback that uses the spider state (other spider code is omitted for brevity):
There are a few things to keep in mind if you want to be able to use the Scrapy persistence support:
Cookies may expire. So, if you don’t resume your spider quickly the requests scheduled may no longer work. This won’t be an issue if your spider doesn’t rely on cookies.
Request serialization
If you wish to log the requests that couldn’t be serialized, you can set the setting to in the project’s settings page. It is False
by default.