--- last_review: "2025-01-01" last_reviewer: "-" documented_code: [] --- ```{tags} tutorial, crawler ``` # Crawler Tutorial: Single structured file :::{note} This page has been migrated from the old documentation, and has not yet been fully revised. There might be inconsistencies or errors when using with current LinkAhead versions. ::: % TODO: Issue: https://gitlab.indiscale.com/caosdb/src/linkahead-docs/-/issues/83 % TODO: Archive documentation if for old crawler In this tutorial, we will create a {term}`crawler ` that reads a single structured file, such as a CSV file. ## Declarations This tutorial is based on the following simple data {term}`schema `: ```{code-block} yaml :caption: schema.yml Fish: recommended_properties: date: datatype: DATETIME number: datatype: INTEGER weight: datatype: DOUBLE species: datatype: TEXT ``` You can insert this model with the following command: ```shell python -m caosadvancedtools.models.parser schema.yml --sync ``` We will identify `Fish` {term}`Records ` in LinkAhead using the following two attributes: ```{code-block} yaml :caption: identifiables.yml Fish: - date - number ``` And we will use the following crawler configuration: ```{code-block} yaml :caption: cfood.yml --- metadata: crawler-version: 0.9.1 --- fish_data_file: # Root file type: CSVTableConverter match: "^fish_data_.*.csv$" # Match CSV file with a name that starts with "fish_data_" subtree: table_row: # One row in the CSV file type: DictElement match_name: .* # we want to treat every row, so match anything match_value: .* records: Fish: # Record for the current row; information from statements below # are added to this Record subtree: date: # Element for the date column type: TextElement match_name: date # Name of the column in the table file match_value: (?P.*) # We match any value of the row in this # column and assign it to the ``column_value`` # variable records: # Records edited for each cell Fish: date: $column_value species: type: TextElement match_name: species match_value: (?P.*) records: Fish: species: $column_value number: type: TextElement match_name: identifier match_value: (?P.*) records: Fish: number: $column_value weight: type: TextElement match_name: weight match_value: (?P.*) records: Fish: weight: $column_value ``` ## Python code The following code allows us to read the csv file, create corresponding `Fish` Records and synchronize those with LinkAhead. ```python #!/usr/bin/env python3 # Copyright (C) 2023-2024 IndiScale GmbH # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as # published by the Free Software Foundation, either version 3 of the # License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see . """Crawler for fish data""" import os import argparse import sys import logging from caoscrawler.scanner import load_definition, create_converter_registry, scan_structure_elements from caoscrawler.structure_elements import File from caoscrawler import Crawler, SecurityMode from caoscrawler.identifiable_adapters import CaosDBIdentifiableAdapter def crawl_file(filename: str, dry_run: bool = False): """Read a CSV file into a LinkAhead container. Parameters ---------- filename : str The name of the CSV file. dry_run : bool If True, do not modify the database. """ # setup logging logger = logging.getLogger("caoscrawler") logger.setLevel(level=(logging.DEBUG)) logger.addHandler(logging.StreamHandler(stream=sys.stdout)) # load crawler configuration definition = load_definition("cfood.yml") converter_registry = create_converter_registry(definition) # crawl the CSV file records = scan_structure_elements(items=File(name= os.path.basename(filename), path=filename), crawler_definition=definition, converter_registry=converter_registry) logger.debug(records) crawler = Crawler(securityMode=SecurityMode.UPDATE) # This defines how Records on the server are identified with the ones we have locally ident = CaosDBIdentifiableAdapter() ident.load_from_yaml_definition("identifiables.yml") crawler.identifiableAdapter = ident # Here we synchronize the data inserts, updates = crawler.synchronize(commit_changes=True, unique_names=True, crawled_data=records) #from IPython import embed #embed() def _parse_arguments(): """Parse the arguments.""" parser = argparse.ArgumentParser(description='Crawler for fish data') parser.add_argument('-n', '--dry-run', help="Do not modify the database.", action="store_true") parser.add_argument('csv_file', metavar="csv file", help="The csv file to be crawled.") return parser.parse_args() def main(): """Main function.""" args = _parse_arguments() crawl_file(args.csv_file, dry_run=args.dry_run) if __name__ == '__main__': main() ``` ## Running it This is an example for the data files that we can crawl: ```{code-block} text :caption: fish_data_1.csv identifier,date,species,weight 1,2022-01-02,pike,3.4 2,2022-01-02,guppy,2.3 3,2022-01-02,pike,2.2 3,2022-01-06,pike,2.1 ``` If you have created all the files, you can run: ```bash python3 crawl.py fish_data_2.csv ``` Note, that you can run the same script again and you will not see any changes being done to the data in LinkAhead. You may play around with changing data in the data table. Changes will propagate into LinkAhead when you run the Crawler again. If you change one of the identifying {term}`properties `, the Crawler will consider the data that it reads as new and create new `Fish` Records.