caoscrawler.converters.converters module#
Converters take structure elements and create Records and new structure elements from them.
- class caoscrawler.converters.converters.BooleanElementConverter(definition: dict, name: str, converter_registry: dict)#
Bases:
_AbstractScalarValueElementConverter- default_matches = {'accept_bool': True, 'accept_float': False, 'accept_int': True, 'accept_text': False}#
- class caoscrawler.converters.converters.CSVTableConverter(definition: dict, name: str, converter_registry: dict)#
Bases:
TableConverter- create_children(generalStore: GeneralStore, element: StructureElement)#
- get_options()#
Get specific options, e.g. from
self.definitions.This method may to be overwritten by the specific table converter to provide information about the possible options. Implementors may use
TableConverter._get_options(...)to get (and convert) options fromself.definitions.- Returns:
out – An options dict.
- Return type:
dict
- class caoscrawler.converters.converters.Converter(definition: dict, name: str, converter_registry: dict)#
Bases:
objectConverters treat StructureElements contained in the hierarchical sturcture.
This is the abstract super class for all Converters.
- apply_transformers(values: GeneralStore, transformer_functions: dict)#
Check if transformers are defined using the “transform” keyword. Then apply the transformers to the variables defined in GeneralStore “values”.
- Parameters:
values (GeneralStore) – The GeneralStore to store values in.
transformer_functions (dict) –
A dictionary of registered functions that can be used within this transformer block. The keys of the dict are the function keys and the values the callable functions of the form:
- def func(in_value: Any, in_parameters: dict) -> Any:
pass
- cleanup()#
This function is called when the converter runs out of scope and can be used to clean up objects that were needed in the converter or its children.
- static converter_factory(definition: dict, name: str, converter_registry: dict)#
Create a Converter instance of the appropriate class.
The type key in the definition defines the Converter class which is being used.
- abstractmethod create_children(values: GeneralStore, element: StructureElement)#
- create_records(values: GeneralStore, records: RecordStore, element: StructureElement)#
- create_values(values: GeneralStore, element: StructureElement)#
Extract information from the structure element and store them as values in the general store.
- Parameters:
values (GeneralStore) – The GeneralStore to store values in.
element (StructureElement) – The StructureElement to extract values from.
- static debug_matching(kind=None)#
- filter_children(children_with_strings: list[tuple[StructureElement, str]], expr: str, group: str, rule: str)#
Filter children according to regexp expr and rule.
- abstractmethod match(element: StructureElement) dict | None#
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.
- match_properties(properties: dict, vardict: dict, label: str = 'match_properties')#
This method can be used to generically match ‘match_properties’ from the cfood definition with the behavior described as follows:
‘match_properties’ is a dictionary of key-regexps and value-regexp pairs. Each key matches a property name and the corresponding value matches its property value.
What a property means in the context of the respective converter can be different, examples:
XMLTag: attributes of the node
ROCrate: properties of the ROCrateEntity
DictElement: properties of the dict
label can be used to customize the name of the dictionary in the definition.
This method is not called by default, but can be called from child classes.
Typically it would be used like this from methods overwriting match:
if not self.match_properties(<properties>, vardict): return None
vardict will be updated in place when there are matches. <properties> is a dictionary taken from the structure element that contains the properties in the context of this converter.
- Parameters:
properties (dict) – The dictionary containing the properties to be matched.
vardict (dict) – This dictionary will be used to store the variables created during the matching.
label (str) – Default “match_properties”. Can be used to change the name of the property in the definition. E.g. the xml converter uses “match_attrib” which makes more sense in the context of xml trees.
- Returns:
Returns True when properties match and False otherwise. The vardict dictionary is updated in place.
- Return type:
bool
- metadata: dict[str, set[str]]#
- setup()#
Analogous to cleanup. Can be used to set up variables that are permanently stored in this converter.
- abstractmethod typecheck(element: StructureElement)#
Check whether the current structure element can be converted using this converter.
- exception caoscrawler.converters.converters.ConverterValidationError(msg)#
Bases:
ExceptionTo be raised if contents of an element to be converted are invalid.
- class caoscrawler.converters.converters.CrawlerTemplate(template)#
Bases:
Template- braceidpattern = '\\D[.\\w]*'#
- pattern = re.compile('\n \\$(?:\n (?P<escaped>\\$) | # Escape sequence of two delimiters\n (?P<named>(?a:[_a-z][_a-z0-9]*)) | # delimiter and a Python identifier\n , re.IGNORECASE|re.VERBOSE)#
- class caoscrawler.converters.converters.DateElementConverter(definition, *args, **kwargs)#
Bases:
TextElementConverterallows to convert different text formats of dates to Python date objects.
The text to be parsed must be contained in the “date” group. The format string can be supplied under “date_format” in the Converter definition. The library used is datetime so see its documentation for information on how to create the format string.
- match(element: StructureElement)#
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.
- class caoscrawler.converters.converters.DatetimeElementConverter(definition, *args, **kwargs)#
Bases:
TextElementConverterConvert text so that it is formatted in a way that LinkAhead can understand it.
The text to be parsed must be in the
valparameter. The format string can be supplied in thedatetime_formatnode. This class uses thedatetimemodule, sodatetime_formatmust follow this specificaton: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes- match(element: StructureElement)#
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.
- class caoscrawler.converters.converters.DictBooleanElementConverter(*args, **kwargs)#
Bases:
BooleanElementConverter
- class caoscrawler.converters.converters.DictConverter(*args, **kwargs)#
Bases:
DictElementConverter
- class caoscrawler.converters.converters.DictDictElementConverter(*args, **kwargs)#
Bases:
DictElementConverter
- class caoscrawler.converters.converters.DictElementConverter(definition: dict, name: str, converter_registry: dict)#
Bases:
ConverterOperates on:
caoscrawler.structure_elements.DictElementGenerates:
caoscrawler.structure_elements.StructureElement- create_children(generalStore: GeneralStore, element: StructureElement)#
- match(element: StructureElement)#
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.
- typecheck(element: StructureElement)#
Check whether the current structure element can be converted using this converter.
- class caoscrawler.converters.converters.DictFloatElementConverter(*args, **kwargs)#
Bases:
FloatElementConverter
- class caoscrawler.converters.converters.DictIntegerElementConverter(*args, **kwargs)#
Bases:
IntegerElementConverter
- class caoscrawler.converters.converters.DictListElementConverter(*args, **kwargs)#
Bases:
ListElementConverter
- class caoscrawler.converters.converters.DictTextElementConverter(*args, **kwargs)#
Bases:
TextElementConverter
- class caoscrawler.converters.converters.DirectoryConverter(definition: dict, name: str, converter_registry: dict)#
Bases:
ConverterConverter that matches and handles structure elements of type directory.
This is one typical starting point of a crawling procedure.
- create_children(generalStore: GeneralStore, element: StructureElement)#
- static create_children_from_directory(element: Directory)#
Creates a list of files (of type File) and directories (of type Directory) for a given directory. No recursion.
element: A directory (of type Directory) which will be traversed.
- create_values(values: GeneralStore, element: StructureElement)#
Extract information from the structure element and store them as values in the general store.
- Parameters:
values (GeneralStore) – The GeneralStore to store values in.
element (StructureElement) – The StructureElement to extract values from.
- match(element: StructureElement)#
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.
- typecheck(element: StructureElement)#
Check whether the current structure element can be converted using this converter.
- class caoscrawler.converters.converters.FileConverter(*args, **kwargs)#
Bases:
SimpleFileConverter
- class caoscrawler.converters.converters.FloatElementConverter(definition: dict, name: str, converter_registry: dict)#
Bases:
_AbstractScalarValueElementConverter- default_matches = {'accept_bool': False, 'accept_float': True, 'accept_int': True, 'accept_text': False}#
- class caoscrawler.converters.converters.IntegerElementConverter(definition: dict, name: str, converter_registry: dict)#
Bases:
_AbstractScalarValueElementConverter- default_matches = {'accept_bool': False, 'accept_float': False, 'accept_int': True, 'accept_text': False}#
- class caoscrawler.converters.converters.JSONFileConverter(definition: dict, name: str, converter_registry: dict)#
Bases:
SimpleFileConverter- create_children(generalStore: GeneralStore, element: StructureElement)#
- class caoscrawler.converters.converters.ListElementConverter(definition: dict, name: str, converter_registry: dict)#
Bases:
Converter- create_children(generalStore: GeneralStore, element: StructureElement)#
- match(element: StructureElement)#
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.
- typecheck(element: StructureElement)#
Check whether the current structure element can be converted using this converter.
- class caoscrawler.converters.converters.MarkdownFileConverter(definition: dict, name: str, converter_registry: dict)#
Bases:
SimpleFileConverterRead the yaml header of markdown files (if a such a header exists).
- create_children(generalStore: GeneralStore, element: StructureElement)#
- class caoscrawler.converters.converters.PropertiesFromDictConverter(definition: dict, name: str, converter_registry: dict, referenced_record_callback: Callable | None = None)#
Bases:
DictElementConverterExtend the
DictElementConverterby a heuristic to set property values from the dictionary keys.- create_records(values: GeneralStore, records: RecordStore, element: StructureElement)#
- class caoscrawler.converters.converters.SimpleFileConverter(definition: dict, name: str, converter_registry: dict)#
Bases:
ConverterJust a file, ignore the contents.
- create_children(generalStore: GeneralStore, element: StructureElement)#
- create_values(values: GeneralStore, element: StructureElement)#
Extract information from the structure element and store them as values in the general store.
- Parameters:
values (GeneralStore) – The GeneralStore to store values in.
element (StructureElement) – The StructureElement to extract values from.
- match(element: StructureElement)#
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.
- typecheck(element: StructureElement)#
Check whether the current structure element can be converted using this converter.
- class caoscrawler.converters.converters.TableConverter(definition: dict, name: str, converter_registry: dict)#
Bases:
ConverterThis converter reads tables in different formats line by line and allows matching the corresponding rows.
The subtree generated by the table converter consists of DictElements, each being a row. The corresponding header elements will become the dictionary keys.
The rows can be matched using a DictElementConverter.
- get_options() dict#
Get specific options, e.g. from
self.definitions.This method may to be overwritten by the specific table converter to provide information about the possible options. Implementors may use
TableConverter._get_options(...)to get (and convert) options fromself.definitions.- Returns:
out – An options dict.
- Return type:
dict
- match(element: StructureElement)#
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.
- typecheck(element: StructureElement)#
Check whether the current structure element can be converted using this converter.
- class caoscrawler.converters.converters.TextElementConverter(definition, *args, **kwargs)#
Bases:
_AbstractScalarValueElementConverter- default_matches = {'accept_bool': True, 'accept_float': True, 'accept_int': True, 'accept_text': True}#
- class caoscrawler.converters.converters.XLSXTableConverter(definition: dict, name: str, converter_registry: dict)#
Bases:
TableConverterOperates on:
caoscrawler.structure_elements.FileGenerates:
caoscrawler.structure_elements.DictElement- create_children(generalStore: GeneralStore, element: StructureElement)#
- get_options()#
Get specific options, e.g. from
self.definitions.This method may to be overwritten by the specific table converter to provide information about the possible options. Implementors may use
TableConverter._get_options(...)to get (and convert) options fromself.definitions.- Returns:
out – An options dict.
- Return type:
dict
- class caoscrawler.converters.converters.YAMLFileConverter(definition: dict, name: str, converter_registry: dict)#
Bases:
SimpleFileConverter- create_children(generalStore: GeneralStore, element: StructureElement)#
- caoscrawler.converters.converters.convert_basic_element(element: list | dict | bool | int | float | str | None, name=None, msg_prefix='')#
Convert basic Python objects to the corresponding StructureElements
- caoscrawler.converters.converters.create_path_value(func)#
Decorator for create_values functions that adds a value containing the path.
should be used for StructureElement that are associated with file system objects that have a path, like File or Directory.
- caoscrawler.converters.converters.create_records(values: GeneralStore, records: RecordStore, def_records: dict) list[tuple[str, str]]#
Create records in GeneralStore values and RecordStore records as given by the definition in def_records.
This function will be called during scanning using the cfood definition. It also should be used by CustomConverters to set records as automatic substitution and other crawler features are applied automatically.
- Parameters:
values (GeneralStore) –
This GeneralStore will be used to access variables that are needed during variable substitution in setting the properties of records and files.
Furthermore, the records that are generated in this function will be stored in this GeneralStore additionally to storing them in the RecordStore given as the second argument to this function.
records (RecordStore) – The RecordStore where the generated records will be stored.
- Returns:
A list of tuples containing the record names (1st element of tuple) and respective property names as 2nd element of the tuples. This list will be used by the scanner for creating the debug tree.
- Return type:
list[tuple[str, str]]
- caoscrawler.converters.converters.handle_value(value: dict | str | list, values: GeneralStore)#
- Determine whether the given value needs to set a property,
be added to an existing value (create a list) or add as an additional property (multiproperty).
Variable names (starting with a “$”) are replaced by the corresponding value stored in the
valuesGeneralStore.
- Parameters:
value (Union[dict, str, list]) –
If str, the value to be interpreted. E.g. “4”, “hello” or “$a” etc. No unit is set and collection mode is determined from the first character: - ‘+’ corresponds to “list” - ‘*’ corresponds to “multiproperty” - everything else is “single”
If dict, it must have a
valuekey and mayunit, andcollection_mode. The returned tuple is directly created from the corresponding values if they are given;unitdefaults to None andcollection_modeis determined fromvalueas explained for the str case above, i.e., - if it starts with ‘+’, collection mode is “list”, - in case of ‘*’, collection mode is “multiproperty”, - and everything else is “single”.If list, each element is checked for variable replacement and the resulting list will be used as (list) value for the property
- Returns:
out –
the final value of the property; variable names contained in values are replaced.
the final unit of the property; variable names contained in values are replaced.
the collection mode (can be single, list or multiproperty)
- Return type:
tuple
- caoscrawler.converters.converters.match_name_and_value(definition, name, value)#
- Take match definitions from the definition argument and apply regular expression to name and
possibly value.
Exactly one of the keys
match_nameandmatchmust exist indefinition,match_valueis optional
- Returns:
None, if match_name or match lead to no match. Otherwise, returns a dictionary with the matched groups, possibly including matches from using definition[“match_value”]
- Return type:
out
- caoscrawler.converters.converters.replace_variables(propvalue: Any, values: GeneralStore)#
This function replaces variables in property values (and possibly other locations, where the crawler can replace cfood-internal variables).
If
propvalueis a single variable name preceeded by a$(e.g.$varor${var}), then the corresponding value stored invaluesis returned. In any other case the variable substitution is carried out as defined by string templates and a new string with the replaced variables is returned.
- caoscrawler.converters.converters.str_to_bool(x)#
- caoscrawler.converters.converters.validate_against_json_schema(instance, schema_resource: dict | str)#
Validate given
instanceagainst givenschema_resource.- Parameters:
instance – Instance to be validated, typically
dictbut can belist,str, etc.schema_resource – Either a path to the JSON file containing the schema or a
dictwith the schema.