---
last_review: "2025-01-01"
last_reviewer: "-"
documented_code: []
---

```{tags} tutorial, crawler, administrator, advanced-user
```

# Setting up a crawler workflow

:::{note}
This page has been migrated from the old documentation, and has not yet been fully revised.
There might be inconsistencies or errors when using with current LinkAhead versions.
:::
% TODO: Issue: https://gitlab.indiscale.com/caosdb/src/linkahead-docs/-/issues/84
% TODO: Replace links with the new doc links
% TODO: This needs context. Probably should be merged into a larger 'set up your own crawler'
% TODO: tutorial

The LinkAhead {term}`crawler <Crawler>` aims to provide a very flexible framework for synchronizing
data on file systems (or potentially other sources of information) with a running LinkAhead
instance. The workflow that is used in the scientific environment should be choosen according to the
users needs. It is also possible to combine multiple workflow or use them in parallel.

In this document we will describe several workflows for crawler operation.

## Local Crawler Operation

A very simple setup that can also reliably be used for testing sets up the crawler on a local
computer. The files that are being crawled need to be visible to both, the locally running crawler
and the LinkAhead server.

### Prerequisites

- Make sure that LinkAhead is running, that your computer has a network connection to LinkAhead and
  that your pycaosdb.ini is pointing to the correct instance of LinkAhead. Please refer to the
  [pylib manual](/how_to/dev_guides/pylib/README_SETUP_pylib.md) for questions related to the
  configuration in pycaosdb.ini
- Make sure that caosdb-crawler and caosdb-advanced-user-tools are installed (e.g. using pip).
- Make sure that you have created:
    - The data model, needed for the crawler.
    - A file "identifiables.yml" describing the identifiables.
    - A cfood file, e.g. cfood.yml.

### Running the crawler

Running the crawler currently involves two steps:

- Inserting the files
- Running the crawler program

#### Inserting the files

This can be done using the module "loadFiles" from caosadvancedtools, see the
[advancedtools section](/how_to/dev_guides/advanced_user_tools/README_SETUP_aut.md) for an
installation guide.

The generic syntax is:

```bash
python3 -m caosadvancedtools.loadFiles -p \<prefix-in-caosdb-file-system> \<path-to-crawled-folder>
```

:::{note}
The \<path-to-crawled-folder> is the location of the files as seen by LinkAhead. This means that for
a LinkAhead instance running in a docker container, the command might be:

```bash
python3 -m caosadvancedtools.loadFiles -p / /opt/caosdb/mnt/extroot/ExperimentalData
```

This command would load the folder "ExperimentalData" contained in the folder `extroot` within the
docker container, and copy it to the LinkAhead root folder.
:::

#### Running the crawler

The following command assumes that the `extroot` folder mounted in the LinkAhead docker container is
located in `../extroot`:

```bash
caosdb-crawler -i identifiables.yml --prefix /extroot --debug --provenance=provenance.yml -s
update cfood.yml ../extroot/ExperimentalData/
```