caoscrawler.crawl module#

Crawl a file structure using a yaml cfood definition and synchronize the acuired data with LinkAhead.

class caoscrawler.crawl.Crawler(generalStore: GeneralStore | None = None, debug: bool | None = None, identifiableAdapter: IdentifiableAdapter | None = None, securityMode: SecurityMode = SecurityMode.UPDATE)#

Bases: object

Crawler class that encapsulates crawling functions. Furthermore it keeps track of the storage for records (record store) and the storage for values (general store).

static check_whether_parent_exists(records: list[Entity], parents: list[str])#

returns a list of all records in records that have a parent that is in parents

static compact_entity_list_representation(entities, referencing_entities: list) str#

a more readable representation than the standard xml representation

TODO this can be removed once the yaml format representation is in pylib

crawl_directory(crawled_directory: str, crawler_definition_path: str, restricted_path: list[str] | None = None)#

The new main function to run the crawler on a directory.

property crawled_data#
static create_entity_summary(entities: list[Entity])#

Creates a summary string reprensentation of a list of entities.

static debug_build_usage_tree(converter: Converter)#
static execute_inserts_in_list(to_be_inserted, securityMode, run_id: UUID | None = None, unique_names=True)#
static execute_parent_updates_in_list(to_be_updated, securityMode, run_id, unique_names)#

Execute the updates of changed parents.

This method is used before the standard inserts and needed because some changes in parents (e.g. of Files) might fail if they are not updated first.

static execute_updates_in_list(to_be_updated, securityMode, run_id: UUID | None = None, unique_names=True)#
generate_run_id()#
static inform_about_pending_changes(pending_changes, run_id, path, inserts=False)#
initialize_converters(crawler_definition: dict, converter_registry: dict)#
load_converters(definition: dict)#
load_definition(crawler_definition_path: str)#
static remove_unnecessary_updates(crawled_data: list[Record], identified_records: list[Record])#

Compare the Records to be updated with their remote correspondant. Only update if there are actual differences.

Return type:

update list without unecessary updates

replace_entities_with_ids(rec: Record)#
static replace_name_with_referenced_entity_id(prop: Property)#

changes the given property in place if it is a reference property that has a name as value

If the Property has a List datatype, each element is treated separately. If the datatype is generic, i.e. FILE or REFERENCE, values stay unchanged. If the value is not a string, the value stays unchanged. If the query using the datatype and the string value does not uniquely identify an Entity, the value stays unchanged. If an Entity is identified, then the string value is replaced by the ID.

save_debug_data(filename: str, debug_tree: DebugTree | None = None)#

Save the information contained in a debug_tree to a file named filename.

static set_ids_and_datatype_of_parents_and_properties(rec_list)#
start_crawling(items: list[StructureElement] | StructureElement, crawler_definition: dict, converter_registry: dict, restricted_path: list[str] | None = None)#
synchronize(commit_changes: bool = True, unique_names: bool = True, crawled_data: list[Record] | None = None, no_insert_RTs: list[str] | None = None, no_update_RTs: list[str] | None = None, path_for_authorized_run: str | list[str] | None = '') tuple[list, list]#

This function applies several stages: 1) Retrieve identifiables for all records in crawled_data. 2) Compare crawled_data with existing records. 3) Insert and update records based on the set of identified differences.

This function makes use of an IdentifiableAdapter which is used to retrieve register and retrieve identifiables.

Parameters:
  • commit_changes (bool, default=True) – If True, the changes are synchronized to the LinkAhead server. For debugging in can be useful to set this to False.

  • unique_names (bool) – Whether or not to update or insert entities in spite of name conflicts.

  • crawled_data (list[db.Record], optional) – The data that shall be synchronized. Should be given, using this method without this parameter is deprecated and will be forbidden in the future.

  • no_insert_RTs (list[str], optional) – list of RecordType names. Records that have one of those RecordTypes as parent will not be inserted

  • no_update_RTs (list[str], optional) – List of RecordType names. Records that have one of those RecordTypes as parent will not be updated

  • path_for_authorized_run (str or list[str], optional) – only used if there are changes that need authorization before being applied. The form for rerunning the crawler with the authorization of these changes will be generated with this path. See caosadvancedtools.crawler.Crawler.save_form for more info about the authorization form.

Returns:

the final to_be_inserted and to_be_updated as tuple.

Return type:

inserts and updates

exception caoscrawler.crawl.ForbiddenTransaction#

Bases: Exception

class caoscrawler.crawl.SecurityMode(*values)#

Bases: Enum

INSERT = 1#
RETRIEVE = 0#
UPDATE = 2#
caoscrawler.crawl.check_identical(record1: Entity, record2: Entity, ignore_id=False)#

Check whether two entities are identical.

This function uses compare_entities to check whether two entities are identical in a quite complex fashion:

  • If one of the entities has additional parents or additional properties -> not identical

  • If the value of one of the properties differs -> not identical

  • If datatype, importance or unit are reported different for a property by compare_entities

    return False only if these attributes are set explicitely by record1. Ignore the difference otherwise.

  • If description, name, id or path appear in list of differences -> not identical.

  • If file, checksum, size appear -> Only different, if explicitely set by record1.

record1 serves as the reference, so datatype, importance and unit checks are carried out using the attributes from record1. In that respect, the function is not symmetrical in its arguments.

caoscrawler.crawl.crawler_main(crawled_directory_path: str | list[str], cfood_file_name: str, identifiables_definition_file: str | None = None, debug: bool = False, provenance_file: str | None = None, dry_run: bool = False, prefix: str = '', securityMode: SecurityMode = SecurityMode.UPDATE, unique_names: bool = True, restricted_path: list[str] | None = None, remove_prefix: str | None = None, add_prefix: str | None = None, sss_max_log_level: int | None = None)#
Parameters:
  • crawled_directory_path (str or list[str]) – path(s) to be crawled

  • cfood_file_name (str) – filename of the cfood to be used

  • identifiables_definition_file (str) – filename of an identifiable definition yaml file

  • debug (bool) – DEPRECATED, use a provenance file instead.

  • provenance_file (str) – Provenance information will be stored in a file with given filename

  • dry_run (bool) – do not commit any chnages to the server

  • prefix (str) – DEPRECATED, remove the given prefix from file paths

  • securityMode (int) – securityMode of Crawler

  • unique_names (bool) – Whether or not to update or insert entities in spite of name conflicts.

  • restricted_path (optional, list of strings) – Traverse the data tree only along the given path. When the end of the given path is reached, traverse the full tree as normal. See docstring of ‘scanner’ in module ‘scanner’ for more details.

  • remove_prefix (Optional[str]) – Remove the given prefix from file paths. See docstring of ‘_fix_file_paths’ for more details.

  • add_prefix (Optional[str]) – Add the given prefix to file paths. See docstring of ‘_fix_file_paths’ for more details.

  • sss_max_log_level (Optional[int]) – If given, set the maximum log level of the server-side scripting log separately from the general debug option. If None is given, the maximum sss log level will be determined from the value of debug: logging.INFO if debug is False, logging.DEBUG if debug is True.

Returns:

return_value – 0 if successful

Return type:

int

caoscrawler.crawl.main()#
caoscrawler.crawl.parse_args()#
caoscrawler.crawl.split_restricted_path(path)#

Split a path string into components separated by slashes or other os.path.sep. Empty elements will be removed.