---
title: "Getting Started with sumer"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Getting Started with sumer}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  dev = "ragg_png"
)
```

## 1. Introduction

The `sumer` package provides tools for translating and analyzing Sumerian cuneiform texts. It converts between different text representations, offers dictionary lookup, and includes an interactive translation tool.

Modern scholars typically work with Sumerian texts in **transliterated** form -- a phonetic rendering in Latin characters. For example, the Sumerian word for king is transliterated as `lugal`. However, translation does not depend on pronunciation. The meaning of a sign depends on the sign itself, not on how it is read aloud. Even when there is reason to believe that words with similar pronunciations have similar meanings, dictionaries can be based solely on the cuneiform characters. The same cuneiform sign can have several different readings (transliterations), but it always has exactly one sign name and one cuneiform character.

```{r}
library(sumer)
```


## 2. Representations of Cuneiform Signs

Each cuneiform sign has three representations:

| Representation    | Example             | Description                                        |
|-------------------|---------------------|----------------------------------------------------|
| Transliteration   | `lugal`             | Phonetic transcription in lowercase letters        |
| Sign name         | `LUGAL`             | Canonical name in uppercase letters                |
| Cuneiform         | &#x12217;           | Unicode character (U+12000 to U+12500)             |

The package works internally with cuneiform characters and sign names. Transliteration serves as a convenient input method.

**Note on display:** Cuneiform characters require a font supporting the Unicode Cuneiform block (U+12000--U+12500). In RStudio, the AGG graphics backend should be enabled (Tools > Global Options > General > Graphics > Backend > AGG).


### 2.1 Retrieving sign information

The function `info()` retrieves all available information about a sign or sign sequence:

```{r}
info("lugal")
```

For compound expressions, all contained signs are analyzed:

```{r}
info("d-en-lil2")
```

Each sign (d, en, lil2) is shown with its sign name (AN, EN, KID) and cuneiform character. The `alternatives` column lists all possible readings -- for instance, EN can also be read as `ru12` or `uru16`.


### 2.2 Conversion between representations

Two functions convert entire texts:

```{r}
# Transliteration -> Cuneiform
as.cuneiform("lugal-e")
as.cuneiform(c("d-en-lil2", "an-ki"))

# Transliteration -> Sign names
as.sign_name("lugal-e")
as.sign_name(c("d-en-lil2", "an-ki"))
```

Within a word, hyphens (`-`) separate syllables; dots (`.`) separate sign names; spaces separate words.


## 3. Dictionary Lookup

### 3.1 Loading a dictionary

The package includes a built-in dictionary:

```{r}
dic <- read_dictionary()
```

The vignette "Translating Sumerian Texts" describes how you can create your own dictionary.

### 3.2 Forward lookup: Sumerian -> English

```{r}
look_up("lugal", dic)
```

The output shows the sign name, cuneiform character, translations with frequency counts and grammatical types, and entries for individual signs and substrings. For compound expressions, all partial combinations are looked up as well:

```{r}
look_up("d-suen", dic)
```

### 3.3 Reverse lookup: English -> Sumerian

To find the Sumerian sign for an English term, use `lang = "en"`:

```{r}
look_up("Enki", dic, "en")
```

The reverse lookup searches all translations and displays matching entries with their sign names and cuneiform characters.


## 4. The Type System

Each dictionary entry has a **grammatical type** in addition to its translation. These types describe the function of a sign in a sentence. Since the same sign can serve different functions depending on context, it may have multiple entries with different types.

### 4.1 Basic types

There are three basic types:

| Type    | Name        | Description                                     | Example                  |
|---------|-------------|-------------------------------------------------|--------------------------|
| **S**   | Substantive | Noun phrases and substantives                   | "king", "Earth"          |
| **V**   | Verb        | Verbs and verbal expressions                    | "create", "go"           |
| **A**   | Attribute   | Modifying clauses                               | "who is strong"          |

You can see the different types of a sign with `look_up()`:

```{r}
look_up("an", dic)
```

The sign AN appears both as a noun (S: "sky/heaven") and as an operator that transforms other expressions -- which brings us to the next topic.


### 4.2 Operators

In addition to basic types, there are **operators**. An operator takes one or two expressions of a certain type as arguments and produces an expression of a (possibly different) type. The notation describes where the arguments stand and what type is produced:

```{r, echo = FALSE}
op_df <- data.frame(
  Notation    = c("Sx->V", "xS->S", "Sx->A", "Sx->S", "xV->V", "Vx->V", "SSx->V"),
  Alternative = c("S\u2612\u2192V", "\u2612S\u2192S", "S\u2612\u2192A", "S\u2612\u2192S",
                   "\u2612V\u2192V", "V\u2612\u2192V", "SS\u2612\u2192V"),
  Meaning     = c("Takes an S to the left, produces a V",
                   "Takes an S to the right, produces an S",
                   "Takes an S to the left, produces an A",
                   "Takes an S to the left, produces an S",
                   "Takes a V to the right, produces a V",
                   "Takes a V to the left, produces a V",
                   "Takes two S to the left, produces a V")
)
knitr::kable(op_df, col.names = c("Notation", "Alternative notation", "Meaning"))
```

The `x` marks the position of the operator itself. In `Sx->V`, the argument S stands to the left of the operator; in `xS->S`, the argument stands to the right. Some operators with two arguments (like `SSx->V`) take both arguments from the same side. The symbols `x` and `->` have the alternative Unicode representations `☒` and `→` that may appear in dictionary entries and line files.

In translations, the placeholder **S** (or **V**) stands for the argument. For example, an operator `xS->S` with the translation "community of S" means: take the noun to the right of this sign and insert it where the S placeholder stands. For operators with two arguments of the same type, the placeholders are numbered: `S1` and `S2`.

Let us trace through a concrete example. Consider the expression `un-ma-gi` from "Enki and the World Order" (line 16), which consists of three signs:

```{r, echo = FALSE}
ex_df <- data.frame(
  Syllable  = c("un", "ma", "gi"),
  Sign      = c("\U00012327", "\U00012220", "\U00012100"),
  Type      = c("\u2612S\u2192S", "S", "S\u2612\u2192S"),
  Translation = c("community of S", "container", "the permanent S")
)
knitr::kable(ex_df, col.names = c("Syllable", "Cuneiform Sign", "Type", "Translation"))
```

We build up the noun phrase step by step:

1. **𒈠** is a simple noun (S): "container".
2. **𒄀** is an operator `S☒→S` that takes the S to its left. It binds 𒈠 and replaces the placeholder: "the permanent **container**". The result is again an S.
3. **𒌧**  is an operator `☒S→S` that takes the S to its right. It binds the result from step 2: "community of **the permanent container**". The result is again an S.

Note that operators with arguments at the right-hand side bind stronger than operators with arguments  at the left-hand side. The final noun phrase is: "community of the permanent container" (type S). In this context, "community" refers to the people and "permanent container" refers to the land of Sumer. The expression `un-ma-gi` thus means "the people of Sumer". In the dictionary, such context-dependent meanings are recorded using the notation **"literal meaning {specific meaning}"**. The dictionary entry for this expression would be: "community of the permanent container {people of Sumer}". The literal meaning documents the compositional structure, while the specific meaning in curly braces gives a contextual interpretation.


### 4.3 Verb types

A simple verb (type **V**) stands on its own -- for example, "to be used as a resource". It combines directly with a subject noun phrase (S + V -> SEN) to form a sentence.

Many Sumerian verbs, however, take a noun phrase as their object. The type **Vt** describes such a **transitive verb**: it takes an S as its object and produces a complex intransitive verb (V). The translation contains an S placeholder for the object, for example "to equip S". The resulting V must then be combined with a subject (S) to form a complete sentence (SEN). `Vt` is a generalization of `Sx->V` that also works correctly when the verb has prefixes or suffixes.

In Sumerian, verbs are often preceded by **verb prefixes** -- signs that modify the verb's meaning (expressing modality, aspect, or other nuances). A verb prefix has the type `☒V→V`: it takes a verb to its right and produces a modified verb. Since each prefix produces a V, multiple prefixes can chain together. They bind from right to left, wrapping around the core verb like layers. Conversely, a **verb suffix** has the type `V☒→V` and binds from left to right. Prefixes and suffixes can co-occur.

Consider the verb chain `gan-ig-la` from line 8 of "Enki and the World Order":

```{r, echo = FALSE}
vt_df <- data.frame(
  Syllable   = c("gan", "ig", "la"),
  Sign       = c("\U000120f6", "\U00012145", "\U000121b7"),
  Type      = c("\u2612V\u2192V", "\u2612V\u2192V", "Vt"),
  Translation = c("may V",
                   "V with the task of establishing sustenance of human existence",
                   "to equip S")
)
knitr::kable(vt_df, col.names = c("Syllable", "Cuneiform Sign", "Type", "Translation"))
```

The verb builds up from the core outward:

```
𒃶          𒅅                         𒆷
☒V→V         ☒V→V                        Vt
"may V"      "V with the task ..."        "to equip S"
              └────────────────────────────┘
              𒅅(𒆷) = "equip S with the task ..."
   └──────────────────────┘
𒃶(𒅅(𒆷)) = "may equip S with the task ..."
```

𒆷 (`Vt`: "to equip S") is the core verb. 𒅅 (`☒V→V`) wraps it with additional meaning. 𒃶 (`☒V→V`) adds modality. The final composed verb is: "may equip S with the task of establishing sustenance of human existence" (`Vt`). This complex verb still has its S placeholder -- it will be filled when the verb meets its object during sentence composition.


### 4.4 Composition rules

When two expressions stand side by side and neither is an operator, they are combined by **composition rules**:

| Left | Right | Result | Translation pattern                          |
|------|-------|--------|----------------------------------------------|
| S    | S     | S      | "X of/with Y"                                |
| S    | A     | S      | "X Y" (juxtaposition)                        |
| S    | V     | SEN    | "X Y" (subject + verb, "to" stripped)        |
| SEN  | SEN   | SEN    | "X. Y" (sentences joined with period)        |

The type **SEN** stands for a complete sentence. It arises when a noun phrase (S) meets a verb (V). For example, if "separated groups" (S) is followed by "to be created" (V), the composition strips the "to" and produces the sentence "Separated groups be created." (SEN). This raw composition must then be finished by hand -- in this case, adjusting the verb form to produce "Separated groups are created."

These rules, together with the operator types from the previous sections, are sufficient to translate entire Sumerian sentences from their individual parts. Since in Old Sumerian texts, signs decode phrases rather than syllables, this implies that the Old Sumerian language is actually not a natural language. It is a formal language that can be pronounced in any natural language that is sufficiently complex.


## 5. Text Analysis

### 5.1 N-gram analysis

A good first step when working with a new text is to search for frequently recurring sign combinations (n-grams). Such patterns are valuable clues: if a certain sequence of cuneiform signs appears repeatedly, it is likely a fixed term, a compound word, or an idiomatic expression.

The package includes the example text "Enki and the World Order":

```{r}
path <- system.file("extdata", "project", "enki_and_the_world_order.txt", package = "sumer")
text <- readLines(path, encoding = "UTF-8")
```

The function `ngram_frequencies()` finds recurring combinations:

```{r}
freq <- ngram_frequencies(text, min_freq = c(6, 4, 2))
head(freq, 10)
```

The `min_freq` parameter controls the minimum frequency for different n-gram lengths. The default value `c(6, 4, 2)` means: single signs must occur at least 6 times, pairs at least 4 times, and all longer combinations at least 2 times.

The analysis works from the longest combinations down to the shortest. When a long combination is identified as frequent, duplicated occurrences are masked so that shorter sub-combinations are not falsely counted as frequent just because they are part of the longer combination.

With `mark_ngrams()`, the identified patterns are marked in the text with curly braces:

```{r}
text_marked <- mark_ngrams(text, freq)
cat(text_marked[1:5], sep = "\n")
```

You can also search for a specific pattern in the annotated text:

```{r}
term    <- "IGI.DIB.TU"
pattern <- mark_ngrams(term, freq)
pattern
result  <- text_marked[grepl(pattern, text_marked, fixed = TRUE)]
cat(result, sep = "\n")
```


### 5.2 Grammar probabilities

To understand the structure of a sentence, it is helpful to know which grammatical role each individual sign is likely to play. The function `sign_grammar()` looks up each sign of a string in the dictionary and counts how often it occurs with each grammatical type:

```{r}
sg  <- sign_grammar("a-ma-ru ba-ur3 ra", dic)
```

The raw frequencies can be refined into probabilities using a Bayesian model. First, compute the prior distribution of types across all signs in the dictionary:

```{r}
prior <- prior_probs(dic, sentence_prob = 0.25)
```

The `sentence_prob` parameter corrects a systematic bias: if a dictionary was primarily built from noun phrases (rather than complete sentences), verbs are underrepresented. A value of 0.25 means that an estimated 25% of the dictionary entries come from complete sentences. Verb probabilities are then upweighted accordingly.

Next, `grammar_probs()` computes the posterior probabilities for each sign:

```{r}
gp <- grammar_probs(sg, prior, dic)
```

For signs with many dictionary entries, the observed frequencies dominate; for rare signs, the result falls back to the prior distribution. The position of a sign in the sequence is currently not taken into account for calculating probabilities.

The function `plot_sign_grammar()` presents the results as a stacked bar chart:

```{r, fig.width = 7, fig.height = 4}
plot_sign_grammar(gp, sign_names = TRUE)
```

Each bar represents a sign position in the sentence. The colours represent grammatical types: green for nouns (S), red shades for verbs (V) and verb operators, blue shades for attribute operators, orange for adjective-like operators (S☒→S), and grey shades for all other operators. A tall bar in a particular colour indicates that the sign likely has that grammatical function.


The chart can also be saved to a file:

```{r, eval = FALSE}
plot_sign_grammar(gp, output_file = "grammar.png")
```


### 5.3 Grammatical structure of a cuneiform text

Once you have assigned grammatical types to each sign, the function `grammatical_structure()` shows how the parts are grouped according to the operator binding and composition rules. The output uses typed brackets to indicate the role of each group: `()` for substantives (S), `<>` for verbs (V), `[]` for attributes (A), and `{}` for sentences (SEN).

Consider the expression `mec3-ki-aj2-ga-ce-er ce du`:

```{r}
x <- "mec3-ki-aj2-ga-ce-er-ce-du"
x <- paste0(info(x)$reading, collapse = "-")
x
expr <- split_sumerian(x)$signs
type <- c("S", "S", "Sx->A", "xS->A", "S", "Sx->S", "S", "Sx->V")

grammatical_structure(x, type, expr)
```

The following figure shows the same result with colour coding:

```{r, echo = FALSE, out.width = "100%"}
img_path <- system.file("extdata", "grammatical_structure.png", package = "sumer")
if (file.exists(img_path)) knitr::include_graphics(img_path)
```

The figure shows that the sentence has the typical structure of an Old Sumerian sentence with the subject (mec3) at its beginning, followed by some specifications of the subject (here in square brackets), followed by the object (ce), and the verb (du) that absorbs the object. This example demonstrates that many Sumerian proper names are self-explanatory. The term "mec3-ki-aj2-ga-ce-er" stands for the proper name "Meskiagasher", but can also be read as a noun phrase. 

This visualization makes the grammatical structure explicit and can help verify that the type assignments produce a sensible grouping.


## 6. Interactive Translation with `translate()`

The function `translate()` opens an interactive Shiny gadget for translating Sumerian text. To demonstrate, we use a fragment from line 16 of "Enki and the World Order":

```{r, eval = FALSE}
x <- as.cuneiform("cag4-kalam-ma-gi-hal. hal-la-gin7.")
result <- translate(x)
```

This expression contains eight cuneiform signs. Our task is to assign each sign a grammatical type and translation, and then compose them into coherent English sentences.


### 6.1 Recognizing sentence boundaries

The input actually contains two sentences. You must recognize sentence boundaries yourself -- they are not detected automatically. In general, sentence boundaries follow directly after verbs.

A striking feature of this Old Sumerian text is that duplicated signs often mark sentence boundaries: the left occurrence functions as a verb at the end of one sentence, while the right occurrence functions as a noun at the beginning of the next sentence. In our example, the sign HAL (𒄬) appears twice. The first HAL is a verb (Vt: "to split S into separate groups") ending the first sentence, while the second HAL is a noun (S: "separated groups") beginning the second sentence.

The two sentences are:

1. `cag4-kalam-ma-gi-hal` (𒊮𒌧𒈠𒄀𒄬): "The central administration splits the people of Sumer into separate groups."
2. `hal-la-gin7` (𒄬𒆷𒁶): "Places for the separated groups are created."


### 6.2 Structure of the translate gadget

When `translate()` opens, you see a scrollable page with the following sections. The gadget is described in more detail in the vignette "Translating Sumerian Texts".

- **N-gram patterns** -- Frequent sign combinations from the text that appear in the current line. This section is only available when used together with a full text.
- **Sign combination suggestions** -- Sign combinations from the current line for which a dictionary offers a translation.
- **Context** -- Neighbouring lines with frequent n-grams marked. This section is only available when used together with a full text.
- **Grammar probabilities** -- A bar chart showing the likelihood of each grammatical type for each sign (see Section 5.2).
- **Translation** -- The main interactive section, containing:
    - A **bracket input field** with an "Update Skeleton" button for defining the sentence structure (see below).
    - An **interactive skeleton** with input fields for type and translation. Each entry has a green lookup button and a brown compose button.
    - A **dictionary panel** that displays lookup results when the green button is clicked.


### 6.3 Looking up and adopting dictionary entries

When the gadget opens, each sign is pre-filled with its most frequent translation from the dictionary. These suggestions are not always correct -- they are simply the entries with the highest count.

Consider the sign `gi=GI=𒄀` in our example. The automatic suggestion may show a noun entry (S) if that is the most frequent type for 𒄀 in the dictionary. However, in this context, 𒄀 functions as an adjective operator `S☒→S` meaning "permanent S".

To correct this:

1. **Click the green arrow button** next to the 𒄀 entry. This opens the dictionary panel below, showing all entries for 𒄀 with their counts and types.
2. **Find the correct entry** -- in this case, `S☒→S`: "permanent S".
3. **Click the dictionary row** to adopt its type and translation into the skeleton.

If you use multiple dictionaries, the first one has priority for the automatic suggestions. All dictionaries are displayed in the lookup panel, so you can choose from any of them.


### 6.4 Defining structure with brackets

In the bracket input field (next to the "Update Skeleton" button), you can control how the skeleton is structured by inserting brackets:

**Round brackets `( )`** group signs into a compound expression. The skeleton will show an entry for the group in addition to entries for its individual signs. Hence, the brackets tell the tool that these signs form a coherent phrase and adds a line to the skeleton where its translation can be entered.

**Angle brackets `< >`** mark a fixed term (typically a proper name). No individual entries are generated for the signs inside. For instance, `<d-en-ki>` would be treated as a single unit "Enki" without breaking it into AN, EN, KI.

**Curly braces `{ }`** mark operator arguments. In most cases this is not necessary, because operators and their arguments are detected automatically. Only when the automatic detection fails -- for instance in ambiguous groupings -- do you need to specify operator arguments explicitly with curly braces.

After editing the brackets, click **"Update Skeleton"** to rebuild the template. All previously entered translations are preserved.


### 6.5 Composing entries with the compose button

Once you have assigned types and translations to the individual signs of a group, you can click the **brown compose button** (&#x120FB;) next to the parent entry. This automatically combines the children into a composed translation, applying the operator and composition rules described in Section 4.

For example, after filling in the three children of `(un-ma-gi)`:

```{r, echo = FALSE}
ex_df <- data.frame(
  Syllable  = c("un", "ma", "gi"),
  Sign      = c("\U00012327", "\U00012220", "\U00012100"),
  Type      = c("\u2612S\u2192S", "S", "S\u2612\u2192S"),
  Translation = c("community of S", "container", "the permanent S")
)
knitr::kable(ex_df, col.names = c("Syllable", "Cuneiform Sign", "Type", "Translation"))
```

clicking the compose button on the parent entry produces: type **S**, translation **"community of the permanent container"**.

The composed translation often needs manual finishing. In this case, you would edit the translation to add the specific meaning: **"community of the permanent container {people of Sumer}"**. Other common adjustments include adding articles or conjugating verbs. The compose button provides a starting point that you then refine.


### 6.6 Result and next steps

When you click **"Done"**, `translate()` returns a `skeleton` object -- a character vector containing the completed translation in pipe format. This can be saved as a text file:

```{r, eval = FALSE}
result <- translate(x)
writeLines(result, "my_translation.txt")
```

The saved file serves as input for building a custom dictionary (see Vignette 2).

A completed translation for our example looks like this:

```
Structure: (𒊮(𒌧𒈠𒄀)𒄬). (𒄬𒆷𒁶).

|cag4-kalam-ma-gi-hal-hal-la-gin7: SEN: The central administration splits
  the people of Sumer into separate groups. Places for the separated
  groups are created.

|cag4-kalam-ma-gi-hal=ŠA3.UN.MA.GI.HAL: SEN: The central administration
  splits the people of Sumer into separate groups.
|	cag4=ŠA3=𒊮: S: center {the central administration}
|	kalam-ma-gi=UN.MA.GI=𒌧𒈠𒄀: S: community of the permanent container {people of Sumer}
|		kalam=UN=𒌧: ☒S→S: community of S
|		ma=MA=𒈠: S: container
|		gi=GI=𒄀: S☒→S: the permanent S
|	hal=HAL=𒄬: Vt: to split S into separate groups

|hal-la-gin7=HAL.LA.DIM2=𒄬𒆷𒁶: SEN: Places for the separated groups
  are created.
|	hal=HAL=𒄬: S: separated groups
|	la=LA=𒆷: S☒→S: place for S
|	gin7=DIM2=𒁶: V: to be created
```

Each line starting with `|` is a dictionary entry. The indentation reflects the hierarchical structure: the overall sentence at the top, word groups below, and individual signs at the deepest level.

**Learning by example.** The package includes an example project with lines 1--31 of "Enki and the World Order" already translated. You can open any of these lines to study the translations and learn how the type system works in practice:

```{r, eval = FALSE}
path <- system.file("extdata", package = "sumer")

file.copy(
  from = file.path(path, "project"),
  to   = tempdir(),
  recursive = TRUE
)

ctx <- translation_context(
  line_folder   = file.path(tempdir(), "project/lines"),
  text          = file.path(tempdir(), "project/enki_and_the_world_order.txt"),
  dic           = file.path(path, "sumer-dictionary.txt"),
  sentence_prob = 0.25
)

# Open line 16 to see the full translation of our example
translate_line(16, ctx)
```

The second vignette ("Translating Sumerian Texts") describes the complete workflow for translating a document line by line and building a dictionary from the results.