Crawler Tutorial: Parameter File#
Note
This page has been migrated from the old documentation, and has not yet been fully revised. There might be inconsistencies or errors when using with current LinkAhead versions.
Our data#
In the “HelloWorld” Example, the Record which was synchronized with the server was created manually using the Python client. Now, we want to have a look at how the Crawler can be used to automate the creation of Records from data.
The Crawler needs instructions on what kind of Records it should create given the data that it sees. This is done using so called “CFood” YAML files.
Let’s once again start with something simple. A common scenario is that we want to insert the
contents of a parameter file. The parameter file may be named params_2022-02-02.json and look like
the following:
{
"frequency": 0.5,
"resolution": 0.01
}
This data describes the two known Properties of our Experiment, and the date in
the file name is the date it was conducted. This means the data model could be described in a
model.yml like this:
Experiment:
recommended_properties:
frequency:
datatype: DOUBLE
resolution:
datatype: DOUBLE
date:
datatype: DATETIME
We assume that there will be at most experiment per day, and that we can identify experiments using
only the date, so the identifiable.yml is:
Experiment:
- date
Getting started with the CFood#
CFoods (Crawler configurations) can be stored in YAML files: The following section in a cfood.yml
tells the Crawler that the key value pair frequency: 0.5 shall be used to set the Property
“frequency” of an “Experiment” Record:
...
my_frequency: # just the name of this section
type: FloatElement # it is a float value
match_name: ^frequency$ # regular expression: Match the 'frequency' key from the data json
match_value: ^(?P<freq_value>.*)$ # regular expression: We match any value of that key
records:
Experiment:
frequency: $freq_value
...
The first part of this section defines which kind of data element will be handled. In this example, this is a key-value pair with the key “frequency” and a float value. We then use this to set the “frequency” Property.
To explain in some more detail, let’s look at what the regular expressions do:
^frequency$assures that the key is exactly “frequency”. “^” matches the beginning of the string and “$” matches the end.^(?P<freq_value>.*)$creates a named match group with the name “freq_value”. The pattern within this group is “.*”: The dot matches any character and the star indicates that the preceding character can occur any number of times, which means that this expression matches any string and assigns it to the group with the namefreq_value.
We can then use the values assigned to a group as a variable. In the above example, we use
frequency: $freq_value to assign the extracted frequency value to the frequency Property of our
new Experiment.
Note
For more information on the cfood.yml specification, read on in the chapter CFoods.
A fully grown CFood#
To give some context on how this section extracting the experiments frequency is included in a
complete CFood to create the full Experiment Record, the full CFood file cfood.yml for this
example might look like the following:
---
metadata:
crawler-version: 0.5.0
---
directory: # corresponds to the directory given to the crawler
type: Directory
match: .* # we do not care how it is named here
subtree:
parameterfile: # corresponds to our parameter file
type: JSONFile
match: params_(?P<date>\d+-\d+-\d+)\.json # extract the date from the parameter file
records:
Experiment: # one Experiment is associated with the file
date: $date # the date is taken from the file name
subtree:
dict: # the JSON contains a dictionary
type: Dict
match: .* # the dictionary does not have a meaningful name
subtree:
my_frequency: # here we parse the frequency...
type: FloatElement
match_name: frequency
match_value: (?P<val>.*)
records:
Experiment:
frequency: $val
resolution: # ... and here the resolution
type: FloatElement
match_name: resolution
match_value: (?P<val>.*)
records:
Experiment:
resolution: $val
You do not need to understand every aspect of this definition, a detailed tutorial on creating a full CFood will be in the next section. For now, we want to see it running!
The crawler can now be run with the following command (assuming that the CFood file is in the current working directory):
caosdb-crawler -s update -i identifiables.yml cfood.yml .
Note
caosdb-crawler currently only works with cfoods which have a directory as top level element.