---
last_review: "2025-01-01"
last_reviewer: "-"
documented_code: [ ]
---

```{tags} tutorial, advanced-user, schema, jsonschema
```

# Conversion between LinkAhead Schema, JSON Schema, and XLSX (and vice versa)

:::{note}
This page has been migrated from the old documentation, and has not yet been fully revised.
There might be inconsistencies or errors when using with current LinkAhead versions.
:::
% TODO: Issue: https://gitlab.indiscale.com/caosdb/src/linkahead-docs/-/issues/77

This file describes the conversion between JSON schema files and XLSX templates, and between JSON
data files following a given schema and XLSX files with data. This conversion is handled by the
Python modules in the `table_json_conversion` library.

## LinkAhead Schema in JSON Schema and JSON data

Let’s start with a simple {term}`Schema`! If you describe a `Person` {term}`Record` with the
{term}`Properties <Property>` `family_name` "Steve" and `given_name` "Stevie" in JSON, it looks like
this:

```json
{
    "Person":
        {
            "family_name": "Steve",
            "given_name": "Stevie"
        }
}
```

The {term}`Schema` in LinkAhead defines the types of {term}`Record`s present in a LinkAhead instance
and their structure. A LinkAhead Schema can also be represented as a JSON Schema.

You can define this kind of structure with the following JSON schema:

```json
{
  "type": "object",
  "properties": {
    "Person": {
      "type": "object",
      "properties": {
        "family_name": {
          "type": "string"
        },
        "given_name": {
          "type": "string"
        }
      }
    }
  },
  "$schema": "https://json-schema.org/draft/2020-12/schema"
}
```

In general, the above schema, and any schema created by `json_schema_exporter.merge_schemas(...)`), 
describe RecordType relations and properties in dictionary form. This form is sufficient to describe
a Schema, however, real data often consists of multiple entries of the same type, for example
multiple Persons, which cannot be done using the schema shown above.

For example, the following JSON describes two “Person” Records:

```json
{
    "Person": [
        {
            "family_name": "Steve",
            "given_name": "Stevie"
        },
        {
            "family_name": "Star",
            "given_name": "Stella"
        }
    ]
}
```

The *JSON Schema* for a JSON like the above one could look like the following:

```json
{
  "type": "object",
  "properties": {
    "Person": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "family_name": {
            "type": "string"
          },
          "given_name": {
            "type": "string"
          }
        }
      }
    }
  },
  "$schema": "https://json-schema.org/draft/2020-12/schema"
}
```

This would define that the top level object/dict may have a key `Person` which has as value an array
of objects that in turn have the properties `family_name` and `given_name`.

You can create a data array schema from a Schema using `xlsx_utils.array_schema_from_model_schema`.

## From JSON to XLSX: Data Representation

The following describes how JSON files representing LinkAhead records are converted into XLSX files,
or how JSON files with records are created from XLSX files.

The attribute name (e.g., “Person” above) determines the {term}`RecordType`, and the value of this
attribute can either be an object or a list. If it is an object (as in the example above), a single
record is represented. In the case of a list, multiple records sharing the same RecordType as the
parent are represented.

The *Properties* of the record (e.g., `family_name` and `given_name` above) become *columns* in the
XLSX file. Thus, the XLSX file created from the above example would have a sheet “Person” with the
following table:

| given_name | family_name |
| ---------- | ----------- |
| Stevie     | Steve       |
| Stella     | Star        |

The properties of objects (Records) in the JSON have an attribute name and a value. The value can
be:

1. A primitive (text, number, boolean, …)
2. A record
3. A list of primitive types
4. A list of unique enums (multiple choice)
5. A list of records

In cases *a.* and *c.*, a cell is created in the column corresponding to the property in the XLSX
file. In case *b.*, columns are created for the Properties of the record, where for each of the
Properties the cases *a.* - *e.* are considered recursively. Case *d.* leads to a number of columns,
one for each of the possible choices.

For case *e.* however, the two-dimensional structure of an XLSX sheet is not sufficient. Therefore,
for such cases, *new* XLSX sheets/tables are created.

In these sheets/tables, the referenced records are treated as described above (new columns for the
Properties). However, there are now additional columns that indicate from which “external” record
these records are referenced.

Let’s now consider these five cases in detail and with examples:

### a. Properties with primitive data types

```json
{
    "Training": [
      {
        "date": "2023-01-01",
        "url": "www.indiscale.com",
        "duration": 1.0,
        "participants": 1,
        "remote": false
      },
      {
        "date": "2023-06-15",
        "url": "www.indiscale.com/next",
        "duration": 2.5,
        "participants": null,
        "remote": true
      }
    ]
}
```

This entry will be represented in an XLSX sheet with the following content:

| date       | url                    | duration | participants | remote |
| ---------- | ---------------------- | -------- | ------------ | ------ |
| 2023-01-01 | www.indiscale.com      | 1.0      | 1            | false  |
| 2023-06-15 | www.indiscale.com/next | 2.5      |              | true   |

### b. Property referencing a record

```json
{
    "Training": [
      {
        "date": "2023-01-01",
        "supervisor": {
            "family_name": "Stevenson",
            "given_name": "Stevie"
        }
      }
    ]
}
```

This entry will be represented in an XLSX sheet named "Training" with the following content:

| date       | supervisor.family_name | supervisor.given_name |
| ---------- | ---------------------- | --------------------- |
| 2023-01-01 | Stevenson              | Stevie                |

### c. Properties containing lists of primitive data types

```json
{
    "Training": [
      {
        "url": "www.indiscale.com",
        "subjects": ["Math", "Physics"],
      }
    ]
}
```

This entry would be represented in an XLSX sheet with the following content:

| url               | subjects     |
| ----------------- | ------------ |
| www.indiscale.com | Math;Physics |

The list elements are written into the cell separated by a semicolon (`;`). If the elements contain
the separator `;`, it is escaped with `\`.

### d. Multiple choice properties

```json
{
    "Training": [
      {
        "date": "2024-04-17",
        "skills": [
              "Planning",
              "Evaluation"
        ]
      }
    ]
}
```

If the `skills` list is denoted as an `enum` array with `"uniqueItems": true` in the json schema,
this entry would be represented like this in an XLSX:

| date       | skills.Planning | skills.Communication | skills.Evaluation |
| ---------- | --------------- | -------------------- | ----------------- |
| 2024-04-17 | x               |                      | x                 |

Note that this example assumes that the list of possible choices, as given in the json schema, was
“Planning, Communication, Evaluation”.

### e. Properties containing lists with references

```json
{
    "Training": [
      {
        "date": "2023-01-01",
        "coach": [
            {
              "family_name": "Sky",
              "given_name": "Max",
            },
            {
              "family_name": "Sky",
              "given_name": "Min",
            }
        ]
      }
    ]
}
```

Since the two coaches cannot be represented properly in a single cell, another worksheet is needed
to contain the properties of the coaches.

The sheet for the Trainings in this example only contains the “date” column

| date       |
| ---------- |
| 2023-01-01 |

Additionally, there is *another* sheet where the coaches are stored. Here, it is crucial to define
how the correct element is chosen from potentially multiple “Trainings”. In this case, it means that
the “date” must be unique.

The second sheet looks like this:

| date       | `coach.family_name` | `coach.given_name` |
| ---------- | ------------------- | ------------------ |
| 2023-01-01 | Sky                 | Max                |
| 2023-01-01 | Sky                 | Min                |

Note: This uniqueness requirement is not strictly checked right now, it is your responsibility as a
user that such “foreign properties” are truly unique.

When converting JSON files which contain Records that were exported from LinkAhead, it can be a good
idea to use the LinkAhead ID as a unique identifier for Records. However, if your Records do not yet
have LinkAhead IDs, you need to define another identifying properties/foreign keys. Note that these
properties only need to identify a Record uniquely within the list of Records: In the above example
the "coach" Record needs to be identified in the list of coaches.

## Data in XLSX: Hidden automation logic

### First column: Marker for row types

The first column in each sheet is used to annotate special rows with metadata, and is hidden by
default. The following values are used:

- `IGNORE`: This row is ignored. It can be used for explanatory texts or layout.
- `COL_TYPE`: Typically the first row that is not `IGNORE`. It indicates the row that defines the
  type of columns (`FOREIGN`, `SCALAR`, `LIST`, `MULTIPLE_CHOICE`, `IGNORE`). This row must occur
  exactly once per sheet.
- `PATH`: Indicates that the row is used to define the path within the JSON. These rows are
  typically hidden for users.

An example table could look like this:

| IGNORE   |                                     | Welcome        | to            | this         | file                        |
| -------- | ----------------------------------- | -------------- | ------------- | ------------ | --------------------------- |
| IGNORE   |                                     | Please         | enter your    | data here:   |                             |
| COL_TYPE | IGNORE                              | SCALAR         | SCALAR        | LIST         | SCALAR                      |
| PATH     |                                     | Training       | Training      | Training     | Training                    |
| PATH     |                                     | url            | date          | subjects     | supervisor                  |
| PATH     |                                     |                |               |              | email                       |
| IGNORE   | Please enter one training per line. | Training URL   | Training date | Subjects     | Supervisor's email          |
|          |                                     | example.com/mp | 2024-02-27    | Math;Physics | <mailto:steve@example.com>  |
|          |                                     | example.com/m  | 2024-02-28    | Math         | <mailto:stella@example.com> |

### Parsing XLSX data

To extract the value of a given cell, we traverse all path elements (in `PATH` rows) from top to
bottom. The final element of the path is the name of the Property to which the value belongs. In the
example above, `steve@example.com` is the value of the `email` Property in the path
`["Training", "supervisor", "email"]`.

The path elements are sufficient to identify the object within a JSON, at least if the corresponding
JSON element is a single object. If the JSON element is an array, the appropriate object within the
array needs to be selected.

For this selection additional `FOREIGN` columns are used. The paths in these columns must all have
the same *base* and one additional *unique key* component. For example, two `FOREIGN` columns could
be `["Training", "date"]` and `["Training", "url"]`, where `["Training"]` is the *base path* and
`"date"` and `"url"` are the *unique keys*.

The base path defines the table (or recordtype) to which the entries belong, and the values of the
unique keys define the actual rows to which data belongs.

For example, this table defines three coaches for the two trainings from the last table:

```{eval-rst}
+----------+-----------------------+-----------------------+------------------------+
| COL_TYPE | FOREIGN               | FOREIGN               | SCALAR                 |
+----------+-----------------------+-----------------------+------------------------+
| PATH     | Training              | Training              | Training               |
+----------+-----------------------+-----------------------+------------------------+
| PATH     | date                  | url                   | coach                  |
+----------+-----------------------+-----------------------+------------------------+
| PATH     |                       |                       | given_name             |
+----------+-----------------------+-----------------------+------------------------+
| IGNORE   | Date of training      | URL of training       | The coach’s given name |
+----------+-----------------------+-----------------------+------------------------+
| IGNORE   | from sheet ‘Training’ | from sheet ‘Training’ |                        |
+----------+-----------------------+-----------------------+------------------------+
|          | 2024-02-27            | example.com/mp        | Ada                    |
+----------+-----------------------+-----------------------+------------------------+
|          | 2024-02-27            | example.com/mp        | Berta                  |
+----------+-----------------------+-----------------------+------------------------+
|          | 2024-02-28            | example.com/m         | Chris                  |
+----------+-----------------------+-----------------------+------------------------+
```

#### Special case: multiple choice “checkboxes”

As a special case, enum arrays with `"uniqueItems": true` can be represented as multiple columns,
with one column per choice. The choices are denoted as the last PATH component, the column type must
be MULTIPLE_CHOICE.

Stored data is denoted as an “x” character in the respective cell, empty cells denote that the item
was not selected. Additionally, the implementation also allows TRUE or 1 for selected items, and
FALSE, 0 or cells with only whitespace characters for deselected items:

```{eval-rst}
+----------+-----------------+----------------------+-------------------+
| COL_TYPE | MULTIPLE_CHOICE | MULTIPLE_CHOICE      | MULTIPLE_CHOICE   |
+----------+-----------------+----------------------+-------------------+
| PATH     | skills          | skills               | skills            |
+----------+-----------------+----------------------+-------------------+
| PATH     | Planning        | Communication        | Evaluation        |
+----------+-----------------+----------------------+-------------------+
| IGNORE   | skills.Planning | skills.Communication | skills.Evaluation |
+----------+-----------------+----------------------+-------------------+
|          | x               |                      | X                 |
+----------+-----------------+----------------------+-------------------+
|          | "  "            | TRUE                 | FALSE             |
+----------+-----------------+----------------------+-------------------+
|          | 0               | x                    | 1                 |
+----------+-----------------+----------------------+-------------------+
```

These rows correspond to:

1. Planning, Evaluation
2. Communication
3. Communication, Evaluation

## User Interaction

The primary and most straight forward use case of this utility is to export LinkAhead data as JSON
and then as XLSX tables. This can be done fully automatic.

% TODO: show how!

The hidden cells for automation are designed such that the XLSX template that is created can be
customized such that it is a nicely formatted table. The hidden content must remain. See below for
tips on how to manipulate the table.

The second use case is to use XLSX to collect data and then import it into LinkAhead. Here, it may
be necessary to define foreign keys in order to identify Records in lists.

### Table Manipulation

- All formatting is ignored
- Nothing has to be observed when adding new data rows
- When adding new descriptory rows (for example one for descriptions of the columns), the `COL_TYPE`
  must be set to `IGNORE`
- You can freely rename sheets.
- You can freely rename columns (since the row containing the column names is set to `IGNROE`; the
  Property name is taken from the last path element)
- You can change the order of columns. However, you have to make sure to move the full column
  including hidden elements. Thus, you should not select a range of cells, but click on the column
  index in your spreadsheet program.

## Note: Requirements

This conversion does not allow arbitrary JSON schema files nor does it support arbitrary JSON files
since conversion to XLSX files would not make sense. Instead, this utility is tailored to supported
conversion of data (and Schema) that are structured like data (and Schema) in LinkAhead:

- The JSON Schema describes a Schema of RecordTypes and Properties as it would be generated by the
  `caosadvancedtools.json_schema_exporter` module.
- The JSON files must contain arrays of Records complying with such a Schema.

Thus, when converting from a JSON schema, the top level of the JSON schema must be a dict. The keys
of the dict are RecordType names.

% TODO: ## Current limitations
% TODO: The current implementation still lacks the following:
% TODO: - Files handling is not implemented yet.