--- last_review: "2025-01-01" last_reviewer: "-" documented_code: [] --- ```{tags} tutorial ``` # Standard Converters :::{note} This page has been migrated from the old documentation, and has not yet been fully revised. There might be inconsistencies or errors when using with current LinkAhead versions. ::: % TODO: Issue: https://gitlab.indiscale.com/caosdb/src/linkahead-docs/-/issues/83 % TODO: Much of this information can probably be moved to the docstrings. These are the standard converters that exist in a default installation. For writing and applying *custom converters*, see [their documentation](custom_converters.md). ## Directory Converter The Directory Converter creates {term}`StructureElements ` for each File and Directory inside the current Directory. You can match a regular expression against the directory name using the 'match' key. With the optional `match_newer_than_file` key, a path to a file containing only an ISO-formatted datetime string can be specified. If this is done, a directory will only match if it contains at least one file or directory that has been modified since that datetime. If the file doesn't exist or contains an invalid string, the directory will be matched regardless of the modification times. ## Simple File Converter The Simple File Converter does not create any children and is usually used if a file shall be used as it is and be inserted and referenced by other {term}`entities `. ## Markdown File Converter Reads a YAML header from Markdown files (if such a header exists) and creates children elements according to the structure of the header. ## DictElement Converter DictElement → StructureElement Creates a child StructureElement for each key in the dictionary. The following StructureElement types are typically created by the DictElement converter: - BooleanElement - FloatElement - TextElement - IntegerElement - ListElement - DictElement Note that you may use `TextElement` for anything that exists in a text format that can be interpreted by the server, such as date and datetime strings in ISO-8601 format. ### match_properties `match_properties` is a dictionary of key-regexps and value-regexp pairs and can be used to match direct {term}`properties ` of a `DictElement`. Each key matches a property name and the corresponding value matches its property value. Example: ........ ```json { "@type": "PropertyValue", "additionalType": "str", "propertyID": "testextra", "value": "hi" } `````` When applied to a dict loaded from the above json, a `DictElementConverter` with the following definition: ```yaml Example: type: DictElement match_properties: additionalType: (?P.*)$ property(.*): (?P.*)$ ``` will match and create two variables: - `addt = "str"` - `propid = "testextra"` ## Scalar Value Converters `BooleanElementConverter`, `FloatElementConverter`, `TextElementConverter`, and `IntegerElementConverter` behave very similarly. These converters expect `match_name` and `match_value` in their definition which allow to match the key and the value, respectively. Note that there are defaults for accepting other types. For example, FloatElementConverter also accepts IntegerElements. The default behavior can be adjusted with the fields `accept_text`, `accept_int`, `accept_float`, and `accept_bool`. The following denotes what kind of StructureElements are accepted by default (they are defined in `src/caoscrawler/converters.py`): - BooleanElementConverter: bool, int - FloatElementConverter: int, float - TextElementConverter: text, bool, int, float - IntegerElementConverter: int - ListElementConverter: list - DictElementConverter: dict ## YAMLFileConverter A specialized converter for yaml files: Yaml files are opened and the contents are converted into dictionaries, which then can be further converted using the typical subtree converters of DictElementConverter. % TODO: ## JSONFileConverter ## TableConverter Table → DictElement A generic converter (abstract) for files containing tables. Currently, there are two specialized implementations for XLSX files and CSV files. All table converters generate a subtree of dicts, which in turn can be converted with DictElementConverters: For each row in the table the TableConverter generates a DictElement (structure element). The key of the element is the row number. The value of the element is a dict containing the mapping of column names to values of the respective cell. Example: ```yaml subtree: TABLE: # Any name for the table as a whole type: CSVTableConverter match: ^test_table.csv$ records: (...) # Records edited for the whole table file subtree: ROW: # Any name for a data row in the table type: DictElement match_name: .* match_value: .* records: (...) # Records edited for each row subtree: COLUMN: # Any name for a specific type of column in the table type: FloatElement match_name: measurement # Name of the column in the table file match_value: (?P` `MyType1` and `MyType2`. It has a scalar property `a` with value 5, a list property `b` with values "a", "b" and "c", and an `author` property which references an `author` with a `full_name` property with the value "Silvia Scientist": :::{figure} /.assets/images/tutorials/crawler/properties-from-dict-records-author.png :alt: |- : A Record "New Name" and an author Record with full_name : "Silvia Scientist" are generated and filled automatically. :height: 210 ::: Note how the different dictionary keys are handled differently depending on their types: scalar and list values are understood automatically, and a dictionary-valued entry like `author` is translated into a reference to an `author` Record automatically. You can further specify how references are treated with an optional `references key` in `record_from_dict`. Let's assume that in the above example, we have an `author` **Property** with datatype `Person` in our data model. We could add this information by extending the above example definition by: ```yaml PropertiesFromDictElement: type: PropertiesFromDictElement match: ".*" record_from_dict: variable_name: MyRec parents: - MyType1 - MyType2 references: author: parents: - Person ``` so that now, the `Person` Record with `full_name` "Silvia Scientist" is created as the value of the `author` property: :::{figure} /.assets/images/tutorials/crawler/properties-from-dict-records-person.png :alt: A new Person Record is created which is referenced as an author. :height: 200 ::: For the time being, only the parents of the referenced Record can be set via this option. More complicated treatments can be implemented via the `referenced_record_callback` (see below). All keys listed under `properties_blacklist` will be excluded from automated treatment, which means it can be used to exclude properties from being automatically created. Since the {py:class}`~caoscrawler.converters.converters.PropertiesFromDictConverter` has all the functionality of the {py:class}`~caoscrawler.converters.converters.DictElementConverter`, individual properties can still be used in a subtree. Together with `properties_blacklist`, this can be used to add custom treatment to specific properties by blacklisting them in `record_from_dict` and then treating them in the subtree the same as you would do it in the standard {py:class}`~caoscrawler.converters.converters.DictElementConverter`. Note that the excluded keys are excluded on **all** levels of the dictionary, which means this applies also when they occur in a referenced entity. For further customization, the {py:class}`~caoscrawler.converters.converters.PropertiesFromDictConverter` can be used as a basis for [custom converters](custom_converters.md), which can use its `referenced_record_callback` argument to add custom handling of references. The `referenced_record_callback` can be a callable object which takes exactly a Record as an argument and needs to return that Record after applying its treatment. Additionally, it is given the `RecordStore` and the `ValueStore` in order to be able to access the records and values that have already been defined from within `referenced_record_callback`. This might look like the following: ```python def my_callback(rec: db.Record, records: RecordStore, values: GeneralStore): # do something with rec, possibly using other records or values from the stores... rec.description = "This was updated in a callback" return rec ``` This function is applied to all Records that are created from the dictionary. This means it can be used to, for example, transform values of some properties, or add special treatment to all Records of a specific type. `referenced_record_callback` is applied **after** the properties from the dictionary have been applied as explained above. ## XML Converters There are the following converters for XML content: ### XMLFileConverter This is a converter that loads an XML file and creates an XMLElement containing the root element of the XML tree. It can be matched in the subtree using the XMLTagConverter. ### XMLTagConverter The XMLTagConverter is a generic converter for XMLElements with the following main features: - It allows to match a combination of tag name, attribute names and text contents using the keys: - `match_tag`: regular expression, default empty string - `match_attrib`: dictionary of key and value pairs, both containing regular expressions. Each key matches an attribute name and the corresponding value matches its attribute value. - `match_text`: regular expression, default empty string - It allows to traverse the tree using XPath (using Python lxml's xpath functions): - The key `xpath` is used to set the xpath expression and has a default of `child::*`. Its default would generate just the list of sub nodes of the current node. The result of the xpath expression is used to generate structure elements as children. It them uses the keys `tags_as_children`, `attribs_as_children` and `text_as_children` to determine which information from the found nodes will be used as children: - `tags_as_children`: (default `true`) For each xml tag element found by the xpath expression, generate one XMLTag structure element. Its name is the full path to the tag using the function `getelementpath` from `lxml`. - `attribs_as_children`: (default `false`) For each xml tag element found by the xpath expression, generate one XMLAttributeNode structure element for each of its attributes. The attribute node's name has the form: `@`. - `text_as_children`: (default `false`) For each xml tag element found by the xpath expression, generate one XMLTextNode structure element containing the text content of the tag element. Note that in case of multiple text elements, only the first one is added. The name of the respective attribute node has the form: ` /text()` to the tag using the function `getelementpath` from `lxml`. :::{note} Currently, there is no converter implemented that can match XMLAttributeNodes. ::: #### Namespaces The default is to use the namespace map from the current node in xpath queries. Because default namespaces cannot be handled by xpath, it is possible to remap the default namespace using the key `default_namespace`. The key `nsmap` can be used to define additional nsmap entries. % TODO: ### XMLTextNodeConverter % TODO: In the future, this converter can be used to match XMLTextNodes that are generated by the % TODO: XMLTagConverter. ## ZipFileConverter This converter opens zip files, unzips them into a temporary directory and exposes the contents as File structure elements. ### Usage Example: ```yaml ExampleZipFile: type: ZipFile match: example\.zip$ subtree: DirInsideZip: type: Directory match: experiments$ FileInsideZip: type: File match: description.odt$ ``` This converter will match and open files called `example.zip`. If the file contains a directory called `experiments` or a file called `description.odt`, they will be processed further by the respective converter in the subtree.