Sampling Data for MongoDB

You can use the options in the Sample View to specify how the connector samples data from MongoDB to generate a schema definition. The connector samples the data in order to detect its structure and determine the data mappings that best support the data.

To sample data for MongoDB:

  1. Choose one:

    • From the start page, click Create New, provide your connection information, and then click Connect. For detailed information about how to specify connection information, Connecting to a Data Store.
    • The Sample dialog box opens.

    • Or, from the Design View, click the Sample View tab. If you are not already connected to a data store, the Schema Editor prompts you to provide your connection information. For detailed information about how to specify connection information, Connecting to a Data Store.
    • The Schema Editor displays the Sample View.

  2. From the Sampling Method drop-down list, select the sampling method to use. You can sample record sequentially by selecting the Forward (starting from the first record), or Backward (starting from the last record) options, or at random by selecting the Random option.

    Note:

    The random sampling method is only supported by MongoDB Server 3.2 or higher.

  3. In the Sampling Count field, type the maximum number of records that the connector can sample to generate the schema definition. To sample every record in the database, set this option to 0.

    Note:

    Typically, sampling a large number of records results in a schema definition that is more accurate and better able to represent all the data in the database. However, the sampling process might take longer than expected when many records are sampled, especially if the database contains complex, nested data structures.

  4. In the Sampling Interval field, type the interval at which the connector samples a record when scanning through the data store. For example, if you set this option to 2, then the connector samples every second record in the data store. This option is ignored if the Sampling Method is set to Random.
  5. In the bottom pane, specify the collections that the connector samples records from by selecting the corresponding check boxes in the Selected column. You can select every collection in the database by selecting the check box in the Selected column header.

    Note:

    You can group and sort collections by clicking a column header. For example, to group collections based on the catalogs they belong to and then sort those groupings by catalog name in ascending order, click the Catalog column header. To sort the list in descending order, click the header again. To disable sorting, click the header a third time.

  6. To filter the data from a collection so that only certain documents are included in the sampling process, type a JSON filter in the corresponding field in the JSON Filter column. The connector samples data only from documents in the collection that meet the filter conditions.
  7. Note:

    • The JSON filter is used only during the sampling process. It does not affect the data that is returned when you preview data in the Design View.
    • For information about the syntax of the JSON Filter value, see "db.collection.find()" in the MongoDB Manual: http://docs.mongodb.org/manual/reference/method/db.collection.find/#db.collection.find. The JSON filter is the argument for the "query" parameter.
    • The value that you type in the JSON Filter field is comparable to a WHERE clause in SQL. For example, to select a row that has the _id value T123 from a table named Customers, you would use the SQL statement SELECT * FROM Customers WHERE _id = "T123". In the Schema Editor, to sample data only from documents that contain the _id value T123 from a collection named Customers, you would select the collection named Customers and then type {"_id": { "$oid" : "123 " }} in the JSON Filter field.
  8. To generate the schema, click Sample.

The connector samples the data as specified and generates a schema definition, which opens in the Design View in the Schema Editor. If you return to the Sample View, you will see that the check boxes in the Sampled column are selected for all the columns that were included in the sampling process.