Tags: tutorial

CFood-Definition#

Note

This page has been migrated from the old documentation, and has not yet been fully revised. There might be inconsistencies or errors when using with current LinkAhead versions.

CFoods specify how data from a file hierarchy is mapped to LinkAhead Records.

In the simplest case, the CFood is just one yaml file with a single document including at least a converter tree specification, as will be explained in example 1).

If metadata and macro definitions are provided, there must be a second document with these definitions preceeding the converter tree specification. This second document may be in the same yaml file, in which two separate yaml documents can be defined using the --- syntax.

It is highly recommended to specify the version of the LinkAhead crawler for which the cfood is written in the metadata section, see below.

There may be some examples in which the custom converter definition is included in the converter tree document for historical reasons, see example 2. This feature is deprecated, and the converter definition should instead be included in the metadata and macro document (see below).

Examples#

A single document with a converter tree specification:

extroot:
  type: Directory
  match: ^extroot$
  subtree:
    DataAnalysis:
      type: Directory
      match: DataAnalysis
      # (...)

A single document with a converter tree specification and a custom converters section:

Converters:
  CustomConverter_1:
    package: mypackage.converters
    converter: CustomConverter1
  CustomConverter_2:
    package: mypackage.converters
    converter: CustomConverter2

extroot:
  type: Directory
  match: ^extroot$
  subtree:
    DataAnalysis:
      type: Directory
      match: DataAnalysis
      # (...)

A yaml multi-document, defining metadata and some macros in the first document and declaring two custom converters in the second document. Using this syntax is not recommended, the preferred syntax for this can be seen in Example 4).

---
metadata:
  name: Datascience CFood
  description: CFood for data from the local data science work group
  crawler-version: 0.2.1
  macros:
  - !defmacro
    name: SimulationDatasetFile
    params:
      match: null
      recordtype: null
      nodename: null
    definition:
      # (...)
---
Converters:
  CustomConverter_1:
    package: mypackage.converters
    converter: CustomConverter1
  CustomConverter_2:
    package: mypackage.converters
    converter: CustomConverter2

extroot:
  type: Directory
  match: ^extroot$
  subtree:
    DataAnalysis:
      type: Directory
      match: DataAnalysis
      # (...)

The recommended way of defining metadata, custom converters, macros and the main cfood specification is shown in the following code example:

---
metadata:
  name: Datascience CFood
  description: CFood for data from the local data science work group
  crawler-version: 0.2.1
  macros:
  - !defmacro
    name: SimulationDatasetFile
    params:
      match: null
      recordtype: null
      nodename: null
    definition:
      # (...)
  Converters:
    CustomConverter_1:
      package: mypackage.converters
      converter: CustomConverter1
    CustomConverter_2:
      package: mypackage.converters
      converter: CustomConverter2
---
extroot:
  type: Directory
  match: ^extroot$
  subtree:
    DataAnalysis:
      type: Directory
      match: DataAnalysis
      # (...)

List Mode#

Specifying values of properties can make use of two special characters, in order to automatically create lists or multi properties instead of single values:

Experiment1:
    Measurement: +Measurement #  Element in List (list is cleared before run)
                 *Measurement #  Multi Property (properties are removed before run)
                 Measurement  #  Overwrite

Values and units#

Property values can be specified as a simple strings (as above) or as a dictionaries that may also specify the collection mode. Strings starting with a “$” will be replaced by a corresponding variable if there is any. See the tutorials chapter of this documentation for more elaborate examples on how the variable replacement works exactly. A simple example could look the following.

ValueElt:
  type: TextElement
  match_name: ^my_prop$
  match_value: "(?P<value>.*)"  # Anything in here is stored in the variable "value"
  records:
    MyRecord:
      MyProp: $value  # will be replace by whatever is stored in the "value" variable set above.

If not given explicitly, the collection mode will be determined from the first character of the property value as explained above. This means the following three definitions are all equivalent:

MyProp: +$value
MyProp:
  value: +$value

and

MyProp:
  value: $value
  collection_mode: list

Units of numeric values can be set by providing a property value as a mapping with two entries, which has the value and unit keys, as shown in this example:

ValueWithUnitElt:
  type: TextElement
  match_name: ^my_prop$
  match_value: "^(?P<number>\\d+\\.?\\d*)\\s+(?P<unit>.+)"  # Extract value and unit from a string which
                                                           # has a number followed by at least one whitespace
                                                           # character followed by a unit.
  records:
    MyRecord:
      MyProp:
        value: $number
        unit: $unit

File Entities#

In order to use File Entities, you must set the appropriate role: File. Additionally, the path and file keys have to be given, with values that set the paths remotely and locally, respectively. You can use the variable <converter name>_path, which is automatically created by converters dealing with file system related StructureElements. The file object itself is stored in a variable with the same name, as is the case for other Records.

somefile:
  type: SimpleFile
  match: ^params.*$  # match any file that starts with "params"
  records:
    fileEntity:
      role: File           # necessary to create a File Entity
      path: somefile.path  # defines the path in LinkAhead
      file: somefile.path  # path where the file is found locally
    SomeRecord:
      ParameterFile: $fileEntity  # creates a reference to the file

Transform Functions#

You can use transform functions to alter variable values that the crawler consumes (e.g. a string that was matched with a regular expression). For more information, refer to the Converter and Transform Functions tutorials.

You can define your own transform functions by adding them the same way you add custom converters:

Transformers:
  transform_foo:
     package: some.package
     function: some_foo

Automatically generated keys#

Some variable names are automatically generated and can be used with the $<variable name> syntax. These include:

  • <converter name>: access the path of converter names to the current converter

  • <converter name>.path: defined only for file system related converters, contains the file system path to the structure element. You need curly brackets to use them: ${<converter name>.path}

  • <Record key>: all entities created in the records section are available under the same key as used in that section