Tags: tutorial, crawler, advanced-user, administrator

Crawler Tutorial: Parameter File#

Note

This page has been migrated from the old documentation, and has not yet been fully revised. There might be inconsistencies or errors when using with current LinkAhead versions.

Our data#

In the “HelloWorld” Example, the Record which was synchronized with the server was created manually using the Python client. Now, we want to have a look at how the Crawler can be used to automate the creation of Records from data.

The Crawler needs instructions on what kind of Records it should create given the data that it sees. This is done using so called “CFood” YAML files.

Let’s once again start with something simple. A common scenario is that we want to insert the contents of a parameter file. The parameter file may be named params_2022-02-02.json and look like the following:

params_2022-02-02.json#

{
  "frequency": 0.5,
  "resolution": 0.01
}

This data describes the two known Properties of our Experiment, and the date in the file name is the date it was conducted. This means the data model could be described in a model.yml like this:

model.yml#

Experiment:
  recommended_properties:
    frequency:
      datatype: DOUBLE
    resolution:
      datatype: DOUBLE
    date:
      datatype: DATETIME

We assume that there will be at most experiment per day, and that we can identify experiments using only the date, so the identifiable.yml is:

identifiable.yml#

Experiment:
  - date

Getting started with the CFood#

CFoods (Crawler configurations) can be stored in YAML files: The following section in a cfood.yml tells the Crawler that the key value pair frequency: 0.5 shall be used to set the Property “frequency” of an “Experiment” Record:

...
my_frequency:  # just the name of this section
  type: FloatElement  # it is a float value
  match_name: ^frequency$  # regular expression: Match the 'frequency' key from the data json
  match_value: ^(?P<freq_value>.*)$  # regular expression: We match any value of that key
  records:
    Experiment:
      frequency: $freq_value
...

The first part of this section defines which kind of data element will be handled. In this example, this is a key-value pair with the key “frequency” and a float value. We then use this to set the “frequency” Property.

To explain in some more detail, let’s look at what the regular expressions do:

^frequency$ assures that the key is exactly “frequency”. “^” matches the beginning of the string and “$” matches the end.
^(?P<freq_value>.*)$ creates a named match group with the name “freq_value”. The pattern within this group is “.*”: The dot matches any character and the star indicates that the preceding character can occur any number of times, which means that this expression matches any string and assigns it to the group with the name freq_value.

We can then use the values assigned to a group as a variable. In the above example, we use frequency: $freq_value to assign the extracted frequency value to the frequency Property of our new Experiment.

Note

For more information on the cfood.yml specification, read on in the chapter CFoods.

A fully grown CFood#

To give some context on how this section extracting the experiments frequency is included in a complete CFood to create the full Experiment Record, the full CFood file cfood.yml for this example might look like the following:

cfood.yml#

---
metadata:
  crawler-version: 0.5.0
---
directory: # corresponds to the directory given to the crawler
  type: Directory
  match: .* # we do not care how it is named here
  subtree:
    parameterfile:  # corresponds to our parameter file
      type: JSONFile
      match: params_(?P<date>\d+-\d+-\d+)\.json # extract the date from the parameter file
      records:
        Experiment: # one Experiment is associated with the file
          date: $date # the date is taken from the file name
      subtree:
        dict:  # the JSON contains a dictionary
          type: Dict
          match: .* # the dictionary does not have a meaningful name
          subtree:
            my_frequency: # here we parse the frequency...
              type: FloatElement
              match_name: frequency
              match_value: (?P<val>.*)
              records:
                Experiment:
                  frequency: $val
            resolution: # ... and here the resolution
              type: FloatElement
              match_name: resolution
              match_value: (?P<val>.*)
              records:
                Experiment:
                  resolution: $val

You do not need to understand every aspect of this definition, a detailed tutorial on creating a full CFood will be in the next section. For now, we want to see it running!

The crawler can now be run with the following command (assuming that the CFood file is in the current working directory):

caosdb-crawler -s update -i identifiables.yml cfood.yml .

Note

caosdb-crawler currently only works with cfoods which have a directory as top level element.