Tags: tutorial, crawler, administrator, advanced-user

Setting up a crawler workflow#

Note

This page has been migrated from the old documentation, and has not yet been fully revised. There might be inconsistencies or errors when using with current LinkAhead versions.

The LinkAhead crawler aims to provide a very flexible framework for synchronizing data on file systems (or potentially other sources of information) with a running LinkAhead instance. The workflow that is used in the scientific environment should be choosen according to the users needs. It is also possible to combine multiple workflow or use them in parallel.

In this document we will describe several workflows for crawler operation.

Local Crawler Operation#

A very simple setup that can also reliably be used for testing sets up the crawler on a local computer. The files that are being crawled need to be visible to both, the locally running crawler and the LinkAhead server.

Prerequisites#

  • Make sure that LinkAhead is running, that your computer has a network connection to LinkAhead and that your pycaosdb.ini is pointing to the correct instance of LinkAhead. Please refer to the pylib manual for questions related to the configuration in pycaosdb.ini

  • Make sure that caosdb-crawler and caosdb-advanced-user-tools are installed (e.g. using pip).

  • Make sure that you have created:

    • The data model, needed for the crawler.

    • A file “identifiables.yml” describing the identifiables.

    • A cfood file, e.g. cfood.yml.

Running the crawler#

Running the crawler currently involves two steps:

  • Inserting the files

  • Running the crawler program

Inserting the files#

This can be done using the module “loadFiles” from caosadvancedtools, see the advancedtools section for an installation guide.

The generic syntax is:

python3 -m caosadvancedtools.loadFiles -p \<prefix-in-caosdb-file-system> \<path-to-crawled-folder>

Note

The <path-to-crawled-folder> is the location of the files as seen by LinkAhead. This means that for a LinkAhead instance running in a docker container, the command might be:

python3 -m caosadvancedtools.loadFiles -p / /opt/caosdb/mnt/extroot/ExperimentalData

This command would load the folder “ExperimentalData” contained in the folder extroot within the docker container, and copy it to the LinkAhead root folder.

Running the crawler#

The following command assumes that the extroot folder mounted in the LinkAhead docker container is located in ../extroot:

caosdb-crawler -i identifiables.yml --prefix /extroot --debug --provenance=provenance.yml -s
update cfood.yml ../extroot/ExperimentalData/