-
Notifications
You must be signed in to change notification settings - Fork 478
FAQ
This document encompasses many of the frequently asked questions (FAQs) about Mongo Connector.
- How do I re-sync all data from scratch?
- What versions of MongoDB are supported by Mongo Connector?
- My oplog progress file always seems really out of date. What's going on?
- Why are some fields in my MongoDB documents not appearing in Solr?
- What is the
mongodb_meta
index in Elasticsearch? - Why are my documents empty in Elasticsearch? Why are updates not happening in Elasticsearch?
- How many threads does Mongo Connector start?
- How do I increase the speed of Mongo Connector?
- Does Mongo Connector support dynamic schemas for Solr?
- How can I load several Solr cores with Mongo Connector?
- I can't install Mongo Connector! I'm getting the error "README.rst: No such file or directory"
- Can I run more than one instance of
mongo-connector
at the same time? - How can I install Mongo Connector without internet access?
- Why is the last entry already processed, Up to date, while using namespace command line args, even though collections are not synced to destination?
- Why can't I use Mongo Connector with only the mongos?
- [Using Mongo Connector with Docker] (#using-mongo-connector-with-docker)
InvalidBSON: date value out of range
- Stop
mongo-connector
. - Delete the oplog progress file.
- Restart
mongo-connector
.
mongo-connector
is compatible with MongoDB >= 2.4.x. Mongo Connector may work with versions of MongoDB prior to 2.4.x, but this has not been tested.
Mongo Connector updates the oplog progress file (called oplog.timestamp
, by default) whenever its cursor into the MongoDB oplog is closed. Note that this may come long after Mongo Connector has read and processed all entries currently in the oplog. This is due to the connector's use of a tailable cursor, which can be re-used to retrieve documents that arrive in the oplog even after the cursor is created. Thus, you cannot rely on the progress file being updated automatically after the oplog is exhausted.
Instead, Mongo Connector provides the --batch-size
option with which you can specify the maximum number of documents Mongo Connector may process before having to record its progress. For example, if you wanted to make sure that Mongo Connector records its progress at least every 100 operations in the oplog, you could run:
mongo-connector -m <source host/port> -t <destination host/port> --batch-size=100
Documents that are missing or have additional fields to the Solr collection schema cannot be inserted, and Solr will log an exception. Thus, Mongo Connector tries to read your Solr collection's schema prior to replicating any operations to Solr in order to avoid sending invalid requests. Documents replicated to Solr from MongoDB may need to be altered to remove fields that aren't in the schema, and the result may look as if your documents are missing certain fields.
The solution to this is to update your schema.xml
file and reload the relevant Solr cores.
Mongo Connector creates a mongodb_meta
index in Elasticsearch in order to keep track of when documents were last modified. This is used to resolve conflicts in the event of a replica set rollback event, but is kept in a separate index so that it can be removed easily if necessary.
Mongo Connector needs _source
to be enabled in order to apply update operations. Make sure that you have this enabled.
Mongo Connector starts one thread for each oplog (i.e., each replica set), and an additional thread to monitor them. Thus, if you have a three-shard cluster, where each shard is a replica set, you will have:
- 1 Connector thread (starts OplogThreads and monitors them)
- 3 OplogThreads (one for each shard)
- Increase the value for
--auto-commit-interval
(or, even better, don't specify it at all and let it beNone
). Setting this value higher means we don't need to refresh the remote system as often and can save time. Leaving this option out entirely leaves when to refresh indexes up to the remote indexing system itself. Most indexing systems have some way to configure this. - If you need only to replicate certain collections, use the
--namespace-set
option to specify these. You can also run separate instances of Mongo Connector, each with a single namespace to replicate, so that you can replicate those namespaces in parallel. Note that this may mean that some collections may be further ahead/behind others, especially if the number of operations is unbalanced across these collections. - You can increase the value for
--batch-size
, or leave it out, so that Mongo Connector records its timestamp less frequently. - You can increase the value for the
bulkSize
for your DocManagers, so that more documents are sent in each request to the remote end.
Mongo Connector does not currently support this. However, restarting Mongo Connector will cause it to re-read the schema definition.
There are two options:
- Use multiple
solr_doc_manager
s. When you do this, all MongoDB collections go to all cores. This isn't a very common use case. - Use multiple instances of
mongo-connector
, passing the base URL of the core todocManagers.XXX.targetURL
. This allows you to refine what collections and what fields from each document get sent to each core.
Make sure you have a recent version of setuptools
installed. Any version after 0.6.26 should do the trick:
pip install --upgrade setuptools
The short answer is yes. However, care must be taken so that multiple connectors operate on mutually exclusive sets of namespaces. This is fine:
mongo-connector -n a.b -g A.B
mongo-connector -n c.d -g C.D
However, the following should be avoided:
mongo-connector -n a.b -g A.B
mongo-connector -n c.d -g A.B
as well as:
mongo-connector -n a.b -g A.B
mongo-connector -n a.b -g C.D
On a server that does have internet access:
python -m pip install --download /path/to/some/dir mongo-connector
Then, on the offline server (which is connected to the first server):
python -m pip install --ignore-installed --no-index --find-links /path/to/some/dir mongo-connector
N.B. Pip is available in standard Python build. However, if your offline
server runs CentOS 6.x or any older Linux distro with Python 2.6.x, you might
have an outdated version of pip
that doesn't use the wheel package
format. Some users have reported having difficulty installing packages offline
without the wheel package format, so you should either
upgrade pip, or you may
need to run the first command several times:
On the server with http access, use same outdated
pip 1.3.1
to grab dependencies. Make sure to run the commands several timespip install --download /path/to/some/dir mongo-connector
(probably 2~3 times so it will grab all the needed dependencies, each time you'll see additional .gz in the folder). After that you cantar zcvf mongo-connector.tar.gz /path/to/grabbed/dependencies
and transfer it to the offline server forpip install --no-index --find-links /path/to/some/dir mongo-connector
. This is not an issue withpip 7.x.x
as it uses Wheel.
Why is the last entry already processed, Up to date, while using namespace command line args, even though collections are not synced to destination?
Mongo-connector works by tailing the oplogs in mongodb. While using namespaces, mongo-connector specifically looks for oplog entries tagged with the given namespace. For example, if you have used mongorestore to restore a whole database with multiple collections in it, mongodb write only one entry with the database name to the oplog. So, trying to use mongo-connector specifically on one collection wouldn't sync anything because there are no entries for that collection. So, make sure some kind of operations are performed on the namespace you are trying yo use.
Mongo-Connector must be able to read the oplogs, which live on the "local" database of each shard. The "local" database is not accessible through mongos, it can only be accessed by directly connecting to each shard. If you are getting the error "OperationFailure: not authorized on local to execute command { find: "oplog.rs", filter: {}, limit: 1, singleBatch: true }", then you are not able to connect to your shards. You can test this yourself by connecting to the mongos and running sh.status(). Try connecting directly to the shard addresses, if you cannot connect then mongo-connector will not be able to run.
If you are running your cluster through Compose or another hosting tool, make sure that you are able to directly access your shards.
We are collecting information from various users about their experiences using Mongo-Connector with Docker. Please check here before filing a new ticket.
-
ServerSelectionTimeoutError: Could not reach any servers in [(u'344da2f17060', 27017)]. Replica set is configured with internal hostnames or IPs?
See issue 391. -
ImportError: No module named 'mongo_connector.doc_managers. elastic_doc_manager
See issue 436. -
Last entry no longer in oplog cannot recover!
See issue 287. -
ConnectionFailed: ConnectionError(('Connection aborted.', BadStatusLine("''",))) caused by: ProtocolError(('Connection aborted.', BadStatusLine("''",)))
See issue 251.
This happens when decoding a document that contains a date value that is outside the range that can be represented with Python datetimes. For example, a year greater than 9999. Since the issue is with Python datetimes, there isn't much that mongo-connector itself can do about it. To work around the issue, install the PyMongo C extensions. The Python C API allows the creation of datetimes to represent a wider range of dates.