--- last_review: "2025-01-01" last_reviewer: "-" documented_code: [] --- ```{tags} tutorial, crawler, advanced-user, administrator ``` # Crawler Tutorial: Parameter File :::{note} This page has been migrated from the old documentation, and has not yet been fully revised. There might be inconsistencies or errors when using with current LinkAhead versions. ::: % TODO: Issue: https://gitlab.indiscale.com/caosdb/src/linkahead-docs/-/issues/83 % TODO: Archive documentation if for old crawler ## Our data In the "HelloWorld" Example, the {term}`Record` which was synchronized with the server was created manually using the Python client. Now, we want to have a look at how the {term}`Crawler` can be used to automate the creation of Records from data. The Crawler needs instructions on what kind of Records it should create given the data that it sees. This is done using so called "{term}`CFood`" YAML files. Let’s once again start with something simple. A common scenario is that we want to insert the contents of a parameter file. The parameter file may be named `params_2022-02-02.json` and look like the following: ```{code-block} json :caption: params_2022-02-02.json { "frequency": 0.5, "resolution": 0.01 } ``` This data describes the two known {term}`Properties ` of our Experiment, and the date in the file name is the date it was conducted. This means the data model could be described in a `model.yml` like this: ```{code-block} yaml :caption: model.yml Experiment: recommended_properties: frequency: datatype: DOUBLE resolution: datatype: DOUBLE date: datatype: DATETIME ``` We assume that there will be at most experiment per day, and that we can identify experiments using only the date, so the `identifiable.yml` is: ```{code-block} yaml :caption: identifiable.yml Experiment: - date ``` ## Getting started with the CFood CFoods (Crawler configurations) can be stored in YAML files: The following section in a `cfood.yml` tells the Crawler that the key value pair `frequency: 0.5` shall be used to set the Property "frequency" of an "Experiment" Record: ```yaml ... my_frequency: # just the name of this section type: FloatElement # it is a float value match_name: ^frequency$ # regular expression: Match the 'frequency' key from the data json match_value: ^(?P.*)$ # regular expression: We match any value of that key records: Experiment: frequency: $freq_value ... ``` The first part of this section defines which kind of data element will be handled. In this example, this is a key-value pair with the key "frequency" and a float value. We then use this to set the "frequency" Property. To explain in some more detail, let's look at what the regular expressions do: - `^frequency$` assures that the key is exactly "frequency". "^" matches the beginning of the string and "\$" matches the end. - `^(?P.*)$` creates a *named match group* with the name "freq_value". The pattern within this group is ".*": The dot matches any character and the star indicates that the preceding character can occur any number of times, which means that this expression matches any string and assigns it to the group with the name `freq_value`. We can then use the values assigned to a group as a variable. In the above example, we use `frequency: $freq_value` to assign the extracted frequency value to the frequency Property of our new Experiment. :::{note} For more information on the ``cfood.yml`` specification, read on in the chapter [CFoods](./cfood). ::: ## A fully grown CFood To give some context on how this section extracting the experiments frequency is included in a complete CFood to create the full Experiment Record, the full CFood file `cfood.yml` for this example might look like the following: ```{code-block} yaml :caption: cfood.yml --- metadata: crawler-version: 0.5.0 --- directory: # corresponds to the directory given to the crawler type: Directory match: .* # we do not care how it is named here subtree: parameterfile: # corresponds to our parameter file type: JSONFile match: params_(?P\d+-\d+-\d+)\.json # extract the date from the parameter file records: Experiment: # one Experiment is associated with the file date: $date # the date is taken from the file name subtree: dict: # the JSON contains a dictionary type: Dict match: .* # the dictionary does not have a meaningful name subtree: my_frequency: # here we parse the frequency... type: FloatElement match_name: frequency match_value: (?P.*) records: Experiment: frequency: $val resolution: # ... and here the resolution type: FloatElement match_name: resolution match_value: (?P.*) records: Experiment: resolution: $val ``` You do not need to understand every aspect of this definition, a detailed tutorial on creating a full CFood will be in the [next section](./cfood.md). For now, we want to see it running! The crawler can now be run with the following command (assuming that the CFood file is in the current working directory): ```sh caosdb-crawler -s update -i identifiables.yml cfood.yml . ``` :::{note} `caosdb-crawler` currently only works with cfoods which have a directory as top level element. :::