I just released a MongoDB pipeline for Scrapy, called scrapy-mongodb
. The module supports both regular MongoDB deployments as well as replica sets. When logging your items with scrapy-mongodb
you will instantly see the collected items in MongoDB. This post will show you how to use it in your Scrapy project.
See the scrapy-mongodb
GitHub page for source code and additional documentation.
Installing scrapy-mongodb
The installation is straight forward. You simply install scrapy-mongodb
using pip
:
pip install scrapy-mongodb
Note that you might need to run pip
as administrator.
Option 1: Configuring scrapy-mongodb
for single MongoDB instances
We need to know some details about the MongoDB database that you want to store your items in. So update your Scrapy settings.py
with the following:
MONGODB_HOST = 'localhost'
MONGODB_PORT = 27017
MONGODB_DATABASE = 'myDatabaseName'
MONGODB_COLLECTION = 'myCollectionName'
If you want us to create and use a unique key for your items, please add the following setting as well:
MONGODB_UNIQUE_KEY = 'keyName'
scrapy-mongodb
will automatically ensure an index on that key.
Then we need to tell Scrapy to use the new pipeline. Add the following to your settings.py
file:
ITEM_PIPELINES = [
'scrapy_mongodb.MongoDBPipeline',
]
Additional configuration options can be found at https://github.com/sebdah/scrapy-mongodb.
Option 2: Configuring scrapy-mongodb
for MongoDB replica sets
If you are logging the items to a MongoDB replica set, you will need to configure scrapy-mongodb
to be replica set aware. Update your Scrapy settings.py
with the following:
MONGODB_REPLICA_SET = 'replicaSetName'
MONGODB_REPLICA_HOSTS = 'h1.example.com,h2.example.com,h3.example.com'
MONGODB_DATABASE = 'myDatabaseName'
MONGODB_COLLECTION = 'myCollectionName'
If you want us to create and use a unique key for your items, please add the following setting as well:
MONGODB_UNIQUE_KEY = 'keyName'
scrapy-mongodb
will automatically ensure an index on that key.
Then we need to tell Scrapy to use the new pipeline. Add the following to your settings.py
file:
ITEM_PIPELINES = [
'scrapy_mongodb.MongoDBPipeline',
]
Additional configuration options can be found at https://github.com/sebdah/scrapy-mongodb.
Summary
Done! Now start your spider just as usual and have a look in MongoDB for your items. They will show as soon as the spider has found and processed them, so you can see the progress as the spider crawls :).