--- last_review: "2025-01-01" last_reviewer: "-" documented_code: [] --- ```{tags} tutorial, crawler, administrator, advanced-user ``` # Setting up a crawler workflow :::{note} This page has been migrated from the old documentation, and has not yet been fully revised. There might be inconsistencies or errors when using with current LinkAhead versions. ::: % TODO: Issue: https://gitlab.indiscale.com/caosdb/src/linkahead-docs/-/issues/84 % TODO: Replace links with the new doc links % TODO: This needs context. Probably should be merged into a larger 'set up your own crawler' % TODO: tutorial The LinkAhead {term}`crawler ` aims to provide a very flexible framework for synchronizing data on file systems (or potentially other sources of information) with a running LinkAhead instance. The workflow that is used in the scientific environment should be choosen according to the users needs. It is also possible to combine multiple workflow or use them in parallel. In this document we will describe several workflows for crawler operation. ## Local Crawler Operation A very simple setup that can also reliably be used for testing sets up the crawler on a local computer. The files that are being crawled need to be visible to both, the locally running crawler and the LinkAhead server. ### Prerequisites - Make sure that LinkAhead is running, that your computer has a network connection to LinkAhead and that your pycaosdb.ini is pointing to the correct instance of LinkAhead. Please refer to the [pylib manual](/how_to/dev_guides/pylib/README_SETUP_pylib.md) for questions related to the configuration in pycaosdb.ini - Make sure that caosdb-crawler and caosdb-advanced-user-tools are installed (e.g. using pip). - Make sure that you have created: - The data model, needed for the crawler. - A file "identifiables.yml" describing the identifiables. - A cfood file, e.g. cfood.yml. ### Running the crawler Running the crawler currently involves two steps: - Inserting the files - Running the crawler program #### Inserting the files This can be done using the module "loadFiles" from caosadvancedtools, see the [advancedtools section](/how_to/dev_guides/advanced_user_tools/README_SETUP_aut.md) for an installation guide. The generic syntax is: ```bash python3 -m caosadvancedtools.loadFiles -p \ \ ``` :::{note} The \ is the location of the files as seen by LinkAhead. This means that for a LinkAhead instance running in a docker container, the command might be: ```bash python3 -m caosadvancedtools.loadFiles -p / /opt/caosdb/mnt/extroot/ExperimentalData ``` This command would load the folder "ExperimentalData" contained in the folder `extroot` within the docker container, and copy it to the LinkAhead root folder. ::: #### Running the crawler The following command assumes that the `extroot` folder mounted in the LinkAhead docker container is located in `../extroot`: ```bash caosdb-crawler -i identifiables.yml --prefix /extroot --debug --provenance=provenance.yml -s update cfood.yml ../extroot/ExperimentalData/ ```