---
last_review: "2025-01-01"
last_reviewer: "-"
documented_code: []
---

```{tags} tutorial, crawler
```

# Crawler Tutorial: Single structured file

:::{note}
This page has been migrated from the old documentation, and has not yet been fully revised.
There might be inconsistencies or errors when using with current LinkAhead versions.
:::
% TODO: Issue: https://gitlab.indiscale.com/caosdb/src/linkahead-docs/-/issues/83
% TODO: Archive documentation if for old crawler

In this tutorial, we will create a {term}`crawler <Crawler>` that reads a single structured file,
such as a CSV file.

## Declarations

This tutorial is based on the following simple data {term}`schema <Schema>`:

```{code-block} yaml
:caption: schema.yml
Fish:
  recommended_properties:
    date:
      datatype: DATETIME
    number:
      datatype: INTEGER
    weight:
      datatype: DOUBLE
    species:
      datatype: TEXT
```

You can insert this model with the following command:

```shell
python -m caosadvancedtools.models.parser schema.yml --sync
```

We will identify `Fish` {term}`Records <Record>` in LinkAhead using the following two attributes:

```{code-block} yaml
:caption: identifiables.yml
Fish:
   - date
   - number
```

And we will use the following crawler configuration:

```{code-block} yaml
:caption: cfood.yml
---
metadata:
  crawler-version: 0.9.1
---

fish_data_file:  # Root file
  type: CSVTableConverter
  match: "^fish_data_.*.csv$"  # Match CSV file with a name that starts with "fish_data_"
  subtree:
    table_row:  # One row in the CSV file
      type: DictElement
      match_name: .* # we want to treat every row, so match anything
      match_value: .*
      records:
        Fish:  # Record for the current row; information from statements below
               # are added to this Record
      subtree:
        date:  # Element for the date column
          type: TextElement
          match_name: date  # Name of the column in the table file
          match_value: (?P<column_value>.*)  # We match any value of the row in this
                                             # column and assign it to the ``column_value``
                                             # variable
          records:  # Records edited for each cell
            Fish:
              date: $column_value
        species:
          type: TextElement
          match_name: species
          match_value: (?P<column_value>.*)
          records:
            Fish:
              species: $column_value
        number:
          type: TextElement
          match_name: identifier
          match_value: (?P<column_value>.*)
          records:
            Fish:
              number: $column_value
        weight:
          type: TextElement
          match_name: weight
          match_value: (?P<column_value>.*)
          records:
            Fish:
              weight: $column_value
```

## Python code

The following code allows us to read the csv file, create corresponding `Fish`
Records and synchronize those with LinkAhead.

```python
#!/usr/bin/env python3

# Copyright (C) 2023-2024 IndiScale GmbH <info@indiscale.com>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU Affero General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.

"""Crawler for fish data"""

import os
import argparse
import sys
import logging

from caoscrawler.scanner import load_definition, create_converter_registry, scan_structure_elements
from caoscrawler.structure_elements import File
from caoscrawler import Crawler, SecurityMode
from caoscrawler.identifiable_adapters import CaosDBIdentifiableAdapter


def crawl_file(filename: str, dry_run: bool = False):
    """Read a CSV file into a LinkAhead container.

Parameters
----------
filename : str
  The name of the CSV file.

dry_run : bool
  If True, do not modify the database.
    """
    # setup logging
    logger = logging.getLogger("caoscrawler")
    logger.setLevel(level=(logging.DEBUG))
    logger.addHandler(logging.StreamHandler(stream=sys.stdout))

    # load crawler configuration
    definition = load_definition("cfood.yml")
    converter_registry = create_converter_registry(definition)

    # crawl the CSV file
    records = scan_structure_elements(items=File(name= os.path.basename(filename), path=filename),
                                      crawler_definition=definition,
                                      converter_registry=converter_registry)
    logger.debug(records)

    crawler = Crawler(securityMode=SecurityMode.UPDATE)
    # This defines how Records on the server are identified with the ones we have locally
    ident = CaosDBIdentifiableAdapter()
    ident.load_from_yaml_definition("identifiables.yml")
    crawler.identifiableAdapter = ident

    # Here we synchronize the data
    inserts, updates = crawler.synchronize(commit_changes=True, unique_names=True,
                                           crawled_data=records)

    #from IPython import embed
    #embed()

def _parse_arguments():
    """Parse the arguments."""
    parser = argparse.ArgumentParser(description='Crawler for fish data')
    parser.add_argument('-n', '--dry-run', help="Do not modify the database.", action="store_true")
    parser.add_argument('csv_file', metavar="csv file", help="The csv file to be crawled.")
    return parser.parse_args()


def main():
    """Main function."""
    args = _parse_arguments()
    crawl_file(args.csv_file, dry_run=args.dry_run)


if __name__ == '__main__':
    main()
```

## Running it

This is an example for the data files that we can crawl:

```{code-block} text
:caption: fish_data_1.csv
identifier,date,species,weight
1,2022-01-02,pike,3.4
2,2022-01-02,guppy,2.3
3,2022-01-02,pike,2.2
3,2022-01-06,pike,2.1
```

If you have created all the files, you can run:

```bash
python3 crawl.py fish_data_2.csv
```

Note, that you can run the same script again and you will not see any changes being done to the data
in LinkAhead.

You may play around with changing data in the data table. Changes will propagate into LinkAhead when
you run the Crawler again. If you change one of the identifying {term}`properties <Property>`, the
Crawler will consider the data that it reads as new and create new `Fish` Records.