Skip to content

DKAN

The DKAN harvester is a CKAN harvester that can be used to harvest metadata from DKAN open data portals.

DKAN is a Drupal-based community-driven, free and open source open data platform that gives organisations and individuals ultimate freedom to publish and consume structured information.

Enable the Harvester

To enable the harvester, add dkan_harvester to the ckan.plugins setting in your CKAN configuration file (e.g., ckan.ini or production.ini).

ckan.plugins = ... dkan_harvester ...

Configuration options

delay [optional]

The delay parameter is used to control the time between requests to the remote API.

When we're working with remote APIs, we need to be mindful of the rate limits imposed by the API provider. The delay parameter allows us to control the time between requests to the remote API. This is done to avoid exceeding the rate limits and to ensure that the requests are spaced out.

Type: int

Default: 0

max_datasets [optional]

The max_datasets parameters is used to limit an amount of datasets you want to harvest for this harvest source.

This feature is useful for testing or development purposes, allowing you to perform a quick test with a smaller subset of data and verify that the harvested data meets your requirements.

If set to 0, all available datasets will be harvested.

Type: int

Default: 0

tsm_schema [optional]

Transmute schema allows you to define a schema that will be used to transform the harvested data before we're trying to create/update a dataset in CKAN.

This is useful when the harvested data doesn't match the CKAN dataset schema and you need to transform it.

Otherwise, you'd need to write a custom harvester and process the remote data yourself.

See the ckanext-transmute documentation to learn more about the transmute schema syntax.

Example
{
    "root": "Dataset",
    "types": {
        "Dataset": {
            "fields": {
                "title": {
                    "validators": [
                        "tsm_string_only",
                        "tsm_to_lowercase",
                        "tsm_name_validator"
                    ],
                    "map": "name"
                },
                "resources": {
                    "type": "Resource",
                    "multiple": true,
                    "map": "attachments"
                },
                "metadata_created": {
                    "validators": [
                        "tsm_isodate"
                    ],
                    "default": "2022-02-03T15:54:26.359453"
                },
                "metadata_modified": {
                    "validators": [
                        "tsm_isodate"
                    ],
                    "default_from": "metadata_created"
                },
                "metadata_reviewed": {
                    "validators": [
                        "tsm_isodate"
                    ],
                    "replace_from": "metadata_modified"
                },
            }
        },
        "Resource": {
            "fields": {
                "title": {
                    "validators": [
                        "tsm_string_only"
                    ],
                    "map": "name"
                },
                "extension": {
                    "validators": [
                        "tsm_string_only",
                        "tsm_to_uppercase"
                    ],
                    "map": "format"
                },
                "web": {
                    "validators": [
                        "tsm_string_only"
                    ],
                    "map": "url"
                },
                "sub-resources": {
                    "type": "Sub-Resource",
                    "multiple": true
                },
            },
        }
    }
}

Type: dict[str, Any]

Default: None