caosadvancedtools.crawler module#
Crawls a file structure and inserts Records into LinkAhead based on what is found.
LinkAhead can automatically be filled with Records based on some file structure. The Crawler will iterate over the files and test for each file whether a CFood exists that matches the file path. If one does, it is instanciated to treat the match. This occurs in basically three steps: 1. create a list of identifiables, i.e. unique representation of LinkAhead Records (such as an experiment belonging to a project and a date/time) 2. the identifiables are either found in LinkAhead or they are created. 3. the identifiables are update based on the date in the file structure
- class caosadvancedtools.crawler.Crawler(cfood_types, use_cache=False, abort_on_exception=True, interactive=True, hideKnown=False, debug_file=None, cache_file=None)#
Bases:
object- check_matches(matches)#
- collect_cfoods()#
This is the first phase of the crawl. It collects all cfoods that shall be processed. The second phase is iterating over cfoods and updating LinkAhead. This separate first step is necessary in order to allow a single cfood being influenced by multiple crawled items. E.g. the FileCrawler can have a single cfood treat multiple files.
This is a very basic implementation and this function should be overwritten by subclasses.
The basic structure of this function should be, that what ever is being processed is iterated and each cfood is checked whether the item ‘matches’. If it does, a cfood is instantiated passing the item as an argument. The match can depend on the cfoods already being created, i.e. a file migth no longer match because it is already treaded by an earlier cfood.
should return cfoods, tbs and errors_occured. # TODO do this via logging? tbs text returned from traceback errors_occured True if at least one error occured
- crawl(security_level=0, path=None)#
- static create_query_for_identifiable(ident)#
uses the properties of ident to create a query that can determine whether the required record already exists.
- static find_existing(entity)#
searches for an entity that matches the identifiable in LinkAhead
Characteristics of the identifiable like, properties, name or id are used for the match.
- static find_or_insert_identifiables(identifiables)#
Sets the ids of identifiables (that do not have already an id from the cache) based on searching LinkAhead and retrieves those entities. The remaining entities (those which can not be retrieved) have no correspondence in LinkAhead and are thus inserted.
- iteritems()#
generates items to be crawled with an index
- static save_form(changes, path, run_id)#
Saves an html website to a file that contains a form with a button to authorize the given changes.
The button will call the crawler with the same path that was used for the current run and with a parameter to authorize the changes of the current run.
Parameters:#
changes: The LinkAhead entities in the version after the update. path: the path defining the subtree that is crawled
- class caosadvancedtools.crawler.FileCrawler(files, **kwargs)#
Bases:
Crawler- iteritems()#
generates items to be crawled with an index
- static query_files(path)#
- class caosadvancedtools.crawler.TableCrawler(table, unique_cols, recordtype, **kwargs)#
Bases:
Crawler- iteritems()#
generates items to be crawled with an index
- caosadvancedtools.crawler.apply_list_of_updates(to_be_updated, update_flags=None, update_cache=None, run_id=None)#
Updates the to_be_updated Container, i.e., pushes the changes to LinkAhead after removing possible duplicates. If a chace is provided, uauthorized updates can be cached for further authorization.
Parameters:#
- to_be_updateddb.Container
Container with the entities that will be updated.
- update_flagsdict, optional
Dictionary of LinkAhead server flags that will be used for the update. Default is an empty dict.
- update_cacheUpdateCache or None, optional
Cache in which the intended updates will be stored so they can be authorized afterwards. Default is None.
- run_idString or None, optional
Id with which the pending updates are cached. Only meaningful if update_cache is provided. Default is None.
- caosadvancedtools.crawler.get_value(prop)#
Returns the value of a Property
- Parameters:
prop (The property of which the value shall be returned.)
- Returns:
out
- Return type:
The value of the property; if the value is an entity, its ID.
- caosadvancedtools.crawler.separated(text)#