---
title: "***zoolog***:  \n Zooarchaeological Analysis with Log-Ratios"
author: "Jose M Pozo, Angela Trentacoste,  Ariadna Nieto-Espinet, Silvia Guimarães Chiarelli, and Silvia Valenzuela-Lamas"
email: "josmpozo@gmail.com, svalenzuela@imf.csic.es"
date: "`r format(Sys.Date())`"
bibliography: ../inst/REFERENCES.bib
## to create the vignettes 'outside' the package, with table of content (toc)
#output:
#  html_document:
#    toc: true
#    toc_float:
#      collapsed: false
#      smooth_scroll: false
## to create the vignettes 'inside' the package, without table of content (toc)
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{zoolog}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

The package *zoolog* includes functions and reference data to generate and 
manipulate log-ratios (also known as log size index (LSI) values) from measurements obtained on zooarchaeological material. Log ratios are used to compare the relative (rather than the absolute) dimensions of animals from archaeological contexts [@meadow1999use]. Essentially, the method compares archaeological measurements to a standard, producing a value that indicates how much larger or smaller the archaeological specimen is compared to that standard. *zoolog* is also able to seamlessly integrate data and references with heterogeneous nomenclature, which is internally managed by a zoolog thesaurus.

The methods included in the package were first developed in the framework of the ERC-Starting Grant 716298 [ZooMWest](https://zoomwest11.wixsite.com/zoomwest) 
(PI S. Valenzuela-Lamas), and were first used 
in the paper [@trentacoste2018pre]. They are based on the techniques proposed by @simpson1941large and @simpson1960quantitative, which calculates *log size index* (LSI) values as: 
$$
\mbox{LSI} = \log_{10} x - \log_{10} x_{\text{ref}} = \log_{10}(x/x_\text{ref}),
$$
where $x$ is the considered measure value and $x_\text{ref}$ is the corresponding reference value. Observe that LSI is defined using logarithms with base 10.

Several different sets of standard reference values are included in the package. These standards include several published and widely used biometric datasets (e.g. @davis1996measurements, @albarella2005neolithic) as well as other less known standards. These references, as well as the data example provided with the package, are based on the measures and measure abbreviations defined in @von1976guide and @davis1992rapid. 

In general, zooarchaeological datasets are composed of skeletal remains representing many different anatomical body parts. In investigation of animal size, the analysis of measurements from a given anatomical element provides the best control for the variables affecting size and shape and, as such, it is the preferable option. Unfortunately, this approach is not always viable due to low sample sizes in some archaeological assemblages. This problem can be mitigated by calculating the LSI values for measurements with respect to a reference, which provides a means of aggregating biometric information from different body parts. The resulting log ratios can be compared and statistically analysed under reasonable conditions [@albarella2002size]. However, length, width, and depth measurements of the anatomical elements still should not be directly aggregated for statistical analysis. 

The package includes a *zoolog thesaurus* to facilitate its usage by research teams across the globe, and working in different languages and with different recording traditions. The thesaurus enables the *zoolog* package to recognises many different names for taxa and skeletal elements (e.g. "Bos taurus", "cattle", "BT",  "bovino", "bota"). Consequently, there is no need to use a particular, standardised recording code for the names of different taxa or elements.

The package also includes a *zoolog taxonomy* to facilitate the management of cases recorded as identified at different taxonomic ranks (Species, Genus, Tribe) and their match with the corresponding references.

## Acknowledgements
We are particularly grateful to Sabine Deschler-Erb and Barbara Stopp, from the University of Basel (Switzerland) for making the reference values of several specimens available through the ICAZ Roman Period Working Group, which have been included here with their permission. We also thank Francesca Slim and Dimitris Filioglou from the University of Groningen, Claudia Minniti from University of Salento, Sierra Harding and Nimrod Marom from the University of Haifa, Carly Ameen and Helene Benkert from the University of Exeter, and Mikolaj Lisowski from the University of York for providing additional reference sets. Allowen Evin (CNRS-ISEM Montpellier) saw potential pitfalls in the use of Davis' references for sheep, which have been now solved.

The thesaurus has benefited from the contributions from Moussab Albesso, Canan Çakirlar, Jwana Chahoud, Jacopo De Grossi Mazzorin, Sabine Deschler-Erb, Dimitrios Filioglou, Armelle Gardeisen, Sierra Harding, Pilar Iborra, Michael MacKinnon, Nimrod Marom, Claudia Minniti, Francesca Slim, Barbara Stopp, and Emmanuelle Vila.

We are grateful to all of them for their contributions, comments, and help. In addition, users are encouraged to contribute to the thesaurus and other references so that *zoolog* can be expanded and adapted to any database.


# Installation

You can install the released version of zoolog from [CRAN](https://CRAN.R-project.org/package=zoolog) with:

``` r
install.packages("zoolog")
```

And the development version from [GitHub](https://github.com/) with:

``` r
install.packages("devtools")
devtools::install_github("josempozo/zoolog@HEAD", build_vignettes = TRUE)
```


# Reference standards {#sec:refStandards}

The package *zoolog* includes several osteometrical references. Currently, the references include reference values for the main domesticates and their agriotypes (*Bos*, *Ovis*, *Capra*, *Sus*), red deer (*Cervus elaphus*), Persian fallow deer (*Dama mesopotamica*), mountain gazelle (*Gazella gazella*), donkey (*Equus asinus*), horse (*Equus caballus*), European rabbit (*Oryctolagus cuniculus*), and grey wolf (*Canis lupus*). These are drawn from a variety of publications and resources (see below). In addition, the user can consider other references, or the provided references can be extended and updated integrating newer research data. Submission of extended/improved references is encouraged. Please, contact the maintainer through the provided email address to make the new reference fully accessible within the package.

These references, as well as the data example provided with the package, are based on the measurements and measure abbreviations defined in @von1976guide and @davis1992rapid. Please, note that Davis’ standard SD, equivalent to Von den Driesch’s DD, has been denoted as Davis.SD in order to resolve its incompatibility with Von den Driesch’s SD. This affects only Davis’ sheep reference and was introduced in Release 1.0.0.

The predefined reference sets included in *zoolog* are provided in the named list `reference`, currently comprising the following `r length(zoolog::reference)` sets:
```{r}
library(zoolog)
str(reference, max.level = 1)
```
The reference set `reference$Combi` is the default reference for computing the [log ratios](../help/LogRatios), since it includes the most comprehensive reference for each species.

The package also includes a `referencesDatabase` collecting the taxon-specific
reference standards from all the considered resources. Each reference set is
composed of a different combination of taxon-specific standards selected from
this `referencesDatabase`. The selection is defined by the data frame 
`referenceSets`:
 
```{r echo=FALSE}
options(knitr.kable.NA = "")
knitr::kable(referenceSets)
```

A detailed description of the reference data, including its structure, properties, and considered resources can be found in the [ReferencesDatabase](../help/referencesDatabase) help page.

# Thesaurus

A [thesaurus set](../help/zoologThesaurus) is defined in order to make the package compatible with the different recording conventions and languages used by authors of zooarchaeological datasets. This enables the function [LogRatios](../help/LogRatios) to match values in the user's dataset with the corresponding ones in the reference standard, regardless of differences in nomenclature or naming conventions, as long as both terms are included in the relevant thesaurus. The thesaurus also allows the user to [standardize the nomenclature](../help/StandardizeNomenclature) of the dataset if desired. 

The user can also use other thesaurus sets or modify the provided one. In this latter case, we encourage the user to contact the maintainer at the provided email address so that the additions can be incorporated into the new versions of the package.

Currently, the zoolog thesaurus set includes four thesauri:

identifierThesaurus
  : For the column names that identify the variables used in computing
    the log ratios. It includes the categories *Taxon*, *Element*, *Measure*, 
    and *Standard*. Each category provides a series of equivalent names. 
    For instance, *Taxon* includes the options 
    `r a <- zoologThesaurus$identifier$Taxon; 
      paste0("*", paste(a[a != ""], collapse = "*, *"), "*")`. 
    This thesaurus is case, accent, and punctuation insensitive, so that, for
    instance, "Especie" is equivalent to 
    "ESPECIE" or "Espècie".

taxonThesaurus
  : For the names of the different taxa when recording animal
    bones. The current categories are 
    *`r paste(names(zoologThesaurus$taxon), collapse = "*, *")`*,
    each with different equivalent names. This thesaurus is case, accent,
    and punctuation insensitive, so that, for instance, "Bos" is equivalent to 
    "bos", "Bos.", or "Bos\ ".

elementThesaurus
  : Names of anatomical elements when recording animal bones. It currently
    includes `r ncol(zoologThesaurus$element)` categories
    (*`r paste(names(zoologThesaurus$element)[1:3], collapse = "*, *")`*, ...),
    each with different equivalent names. 
    This thesaurus is case, accent, and punctuation insensitive.

measureThesaurus
  : Names of the measurements. While the English abbreviations from
    @von1976guide and @davis1992rapid are widely used in published
    literature, this thesaurus enables other nomenclatures (e.g. original German
    abbreviations in @von1976guide) to be included. This thesaurus is case
    sensitive.

# Taxonomy

A [taxonomy](../help/zoologTaxonomy) for the most typical zooarchaeological taxa has been introduced in the *zoolog* major release 1.0.0. The taxonomic hierarchy is structured up to the family level:
``` {r, echo=FALSE}
knitr::kable(zoologTaxonomy)
```
This taxonomy is intended to facilitate the management of cases recorded as identified at different taxonomic ranks and their match with the corresponding references.

The taxonomy is integrated in the function [LogRatios](../help/LogRatios), enabling it to automatically match different species in data and reference that are under the same genus. For instance, data of *Bos taurus* can be matched with reference of *Bos primigenius*, since both are *Bos*. It also enables the function `LogRatios` to detect when a case taxon has been only partially identified and recorded at a higher taxonomic rank, such as tribe or family, and to suggest the user the set of possible reference species.

Besides, it is complemented with a series of functions enabling the user to query for the [subtaxonomy](../help/Subtaxonomy) or the set of species under a queried taxon at any taxonomic rank. 

# Functions
The full list of functions is available under the zoolog *help* page. We list them here sorted by their prominence for a typical user, and grouped by functionality:

[**LogRatios**](../help/LogRatios)
  : It computes the log ratios of the measurements in a dataset relative to 
    standard reference values. By default `reference$Combi` is used. The 
    function includes the option 'joinCategories' allowing several taxa 
    (typically *Ovis*, *Capra*, and unknown *Ovis/Capra*) to be considered
    together with the same reference taxon.
  : Note that without using 'joinCategories' any taxa not part of the selected
    reference set will be excluded. For instance, if using
    `reference$NietoDavisAlbarella`, log ratios for goats will not be calculated 
    unless 'joinCategories' is set to indicate that the *Ovis aries* standard
    should also be applied to goats.

[**CondenseLogs**](../help/CondenseLogs)
  : It condenses the calculated log ratio values into a reduced number of 
    features by grouping several measure log ratios and selecting or calculating 
    a representative feature value. By default the selected groups represent a
    single dimension, i.e. *Length*, *Width*, and *Depth*. Only one feature is 
    extracted per group. Currently, two methods are possible: "priority" 
    (default) or "average". 
  : This operation is motivated by two circumstances. First, not all
    measurements are available for every bone specimen, which obstructs their
    direct comparison and statistical analysis. Second, several measurements 
    can be strongly correlated (e.g. SD and Bd both represent bone width). 
    Thus, considering them as independent would produce an over-representation 
    of bone remains with multiple measurements per axis. Condensing each group 
    of measurements into a single feature (e.g. one measure per axis) alleviates
    both problems. 
  : The default method ("priority"), selects the first available log ratio in 
    each group. Besides, `CondenseLogs` employs the following by-default group
    and prioritization introduced in @trentacoste2018pre: 
    *Length* considers in order of priority *GL*, *GLl*, *GLm*, and *HTC*. 
    *Width* considers in order of priority  *BT*, *Bd*, *Bp*, *SD*, *Bfd*, 
    and *Bfp*. 
    *Depth* considers in order of priority  *Dd*, *DD*, *BG*, and *Dp*. 
    This order maximises the robustness and
    reliability of the measurements, as priority is given to the most abundant,
    more replicable, and less age dependent measurements. But users can set 
    their own features with any group of measures and priorities. 
    The method "average" extracts the mean per group, ignoring the 
    non-available log ratios.

[**RemoveNACases**](../help/RemoveNACases)
  : It removes the cases (table rows) for which all measurements of interest are
    non-available (NA). A particular list of measurement names can be explicitly
    provided or selected by a common initial pattern (e.g. prefix). The default
    setting removes the rows with no available log ratios to facilitate 
    subsequent analysis of the data.

[**InCategory**](../help/InCategory)
  : It checks if an element belongs to a category according to a thesaurus. 
    It is similar to [base::is.element](../help/is.element), returning a logical
    vector indicating if each element in a given vector is included in a given
    set. 
    But `InCategory` checks for equality assuming the equivalencies defined in 
    the given thesaurus. It is intended for the user to easily select a subset
    of data without having to standardize the analysed dataset.

[**Nomenclature standardization**](../help/StandardizeNomenclature)
  : This includes two functions enabling the user to map data with heterogeneous
    nomenclature into a standard one as defined in a thesaurus:

    * `StandardizeNomenclature` standardizes a character vector according to 
      a given thesaurus.
    * `StandardizeDataSet` standardizes column names and values of a data 
      frame according to a thesaurus set.

[**Subtaxonomy**](../help/Subtaxonomy)
  : This includes three functions enabling the user to query for some 
  information on the subtaxonomy under a queried taxon at any taxonomic 
  rank:

    * `Subtaxonomy` provides the subtaxonomy dataframe collecting the 
    species (rows) included in the queried taxon, and the taxonomic ranks
    (columns) up to its level.
    * `SubtaxonomySet` provides the set (character vector without repetions) 
    of taxa, at any taxonomic rank, under 
    the queried taxon.
    * `GetSpeciesIn` provides the set of species included in the queried taxon.

  : By default, the subtaxonomy information is extracted from the *zoolog* 
  taxonomy.

[**AssembleReference**](../help/AssembleReference)
  : It allows the user to build new references assembling the desired
    taxon-specific references included in the *zoolog* `referencesDataSet` or
    in any other provided by the user.

[**Thesaurus readers and writers**](../help/ThesaurusReaderWriter)
  : This includes functions to read and write a single thesaurus 
    (`ReadThesaurus` and `WriteThesaurus`) and a thesaurus set
    (`ReadThesaurusSet` and `WriteThesaurusSet`).
    
[**Thesaurus management**](../help/ThesaurusManagement)
  : This includes functions to modify and check thesauri:
  
    * `NewThesaurus` generates an empty thesarus.
    * `AddToThesaurus` adds new names and categories to an existing thesaurus.
    * `RemoveRepeatedNames` cleans a thesarus from any repeated names on any 
      category.
    * `ThesaurusAmbiguity` checks if there are names included in more than one
      category in a thesaurus.


# Examples
The following examples are designed to be read and run sequentially. They 
represent a possible pipeline, meaningful for the processing and analysis of a
dataset. Only occasionally, a small diversion is included to illustrate some alternatives.

## Reading data and calculating log ratios
This example reads a dataset from a file in csv format and computes the
log-ratios. Then, the cases with no available log-ratios are removed.
Finally, the resulting dataset is saved in a file in csv format.

The first step is to set the local path to the folder where you have the dataset to be analysed (this is typically a comma-separated value (csv) file). Here the [example dataset](../help/dataValenzuelaLamas2008) from @valenzuela2008alimentacio included in the package is used:

```{r, echo = FALSE}
options(stringsAsFactors = FALSE)
```
```{r}
library(zoolog)
dataFile <- system.file("extdata", "dataValenzuelaLamas2008.csv.gz", 
                        package = "zoolog")
data = read.csv2(dataFile,
                 quote = "\"", header = TRUE, na.strings = "",
                 encoding = "UTF-8")
knitr::kable(head(data)[, -c(6:20,32:64)])
```
To enhance the visibility, we have shown only the most relevant columns.

We now calculate the log-ratios using the function `LogRatios`. Only measurements that have an associated standard will be included in this calculation. The log values will appear as new columns with the prefix 'log' following the original columns with the raw measurements:
```{r}
dataWithLog <- LogRatios(data)
knitr::kable(head(dataWithLog)[, -c(6:20,32:64)])
```

### A note on the warnings of LogRatios
Observe that `LogRatios` has output two warnings above. The first one informs the user that the data includes some cases of recorded taxa that did not match any species in the reference, but was matched by genus. In many cases this can be the behaviour expected by the user. For instance, using the reference of *Sus scrofa* for computing the log ratios of cases of *Sus domesticus*, can be acceptable if no better reference is available or if both (sub)species are being compared. The warning also includes cases recorded by genus, a recording practice that is understandable when working only with one species in the genus, such as *Oryctolagus* or *Ovis* above. However, the warning also informs the user that this matching by genus can be suppressed by setting the parameter `useGenusIfUnambiguous = FALSE`. This can be the case if the user wants to exclude from the analysis the cases with not matching species. Besides, this warning could make the user realize that the wrong reference was set by mistake.

The second warning informs the user that the data includes some cases recorded with a taxon not specifying a species or genus, but at a higher taxonomical rank (for instance undecided *ovis/capra* (equivalent to tribe *Caprini*)). It also informs of the cases recorded by the genus but for which the reference includes more than one species (as happens with *Equus* above). In addition, the user is remembered of the option to use the parameter `joinCategories` to indicate which reference species should be used for them.

These relationships between taxonomical categories is possible thanks to the taxonomy included in the package. The user can include their own taxonomy if desired.

In the following examples, these warnings are suppressed unless some particular message is of interest for them.

## Dealing with *lazy* datasets
If we observe the example dataset more carefully, we can see that the measures
recorded for the *astragali* presents a deviation from the measure definitions 
in @von1976guide and @davis1992rapid. To see this, 
we can select the cases where the element is an astragalus. The function [InCategory](../help/InCategory) allows us to
select them with the help of the thesaurus without requiring to know the terms actually used:
```{r}
AScases <- InCategory(dataWithLog$Os, "astragalus", zoologThesaurus$element)
knitr::kable(head(dataWithLog[AScases, -c(6:20,32:64)]))
```
For the involved taxa, according to the measure definitions, astragali should have no *GL* measurement, but *GLl*. However, in the example dataset the *GLl* measurements have been recorded merged in the *GL* column. This is a data-entry simplification that is used by some researchers. It is possible because *GLl* is only relevant for the astragalus, while *GL* is not applicable to it. Thus, there cannot be any ambiguity between both measures since they can be identified by the bone element. However, since the zoolog reference uses the proper measure name for each bone element (*GLl* for the astragalus), the reference measure has not been correctly identified. Consequently, the log ratio *logGL* has `NA` values and the column *logGLl* does not exists.

The same effect happens for the measure *GLpe*, only relevant for the phalanges. 

The optional parameter mergedMeasures facilitates the processing of this type of simplified datasets. For the example data, we can use
```{r, warning = FALSE}
GLVariants <- list(c("GL", "GLl", "GLpe"))
dataWithLog <- LogRatios(data, mergedMeasures = GLVariants)
knitr::kable(head(dataWithLog[AScases, -c(6:20,32:64)]))
```
This option allows us to automatically select, for each bone element, the corresponding measure present in the reference. Observe that now the log ratios have been computed and assigned to the column *logGL*.

## Using the same ovis reference for all caprines
We could be interested in obtaining the log ratios of all caprines, including *Ovis aries*, *Capra hircus*, and undetermined *Ovis/Capra*, with respect to the reference for *Ovis aries*. This can be set using the argument `joinCategories`.
```{r, warning = FALSE}
caprineCategory <- list(ovar = SubtaxonomySet("caprine"))
dataWithLog <- LogRatios(data, joinCategories = caprineCategory, mergedMeasures = GLVariants)
knitr::kable(head(dataWithLog)[, -c(6:20,32:64)])
```
The category to join can be manually defined, but here we have conveniently used the function `Subtaxonomy` applied to the tribe *Caprini*:
```{r}
SubtaxonomySet("caprine")
```

Note that this option does not remove the distinction in the data between the different species, it just indicates that for these taxa the log ratios must be computed from the same reference ("ovar").

## Pruning the data from cases with no available measure
The cases without log-ratios can be removed to facilitate subsequent analyses:
```{r}
dataWithLogPruned=RemoveNACases(dataWithLog)
knitr::kable(head(dataWithLogPruned[, -c(6:20,32:64)]))
```

You may want to write the resulting file in the working directory (you need to set it first):
```{r, eval = FALSE, warning = FALSE}
write.csv2(dataWithLogPruned, "myDataWithLogValues.csv", 
           quote=FALSE, row.names=FALSE, na="", 
           fileEncoding="UTF-8")
```

## Condensing log values
After calculating log ratios using the `LogRatios` function, many rows in the resultant dataframe (dataWithLog in the example above) may contain multiple log values, i.e. you will have several log values associated with a particular archaeological specimen. When analysing log ratios, it is preferential to avoid over-representation of bones with a greater number of measurements and account for each specimen only once. The `CondenseLogs` function extracts one length, one width, and one depth value from each row and places these in new Length, Width, and Depth columns. The 'priority' method described in @trentacoste2018pre has been set as default. In this case, the default option has been used:
```{r}
dataWithSummary <- CondenseLogs(dataWithLogPruned)
knitr::kable(head(dataWithSummary)[, -c(6:20,32:64,72:86)])
```
Nevertheless, other options (e.g. average of all width log values for a given specimen) can be chosen if preferred. 

## Standardizing the dataset nomenclature
The integration of the thesaurus functionality facilitates the use of datasets with heterogeneous nomenclatures, without further preprocessing. An extensive catalogue of names for equivalent categories has been integrated in the provided thesaurus set `zoologThesaurus`. These equivalences are internally and silently managed without requiring any action from the user. However, it can be also interesting to explicitly standardize the data to make figures legible to a wider audience. This is especially useful when different nomenclature for the same concept is found in the same dataset, for instance "sheep" and "ovis" for the same taxon or "hum" and "HU" for the bone element.

If we standardize the studied data, we can see that `zoologThesaurus` will change "ovar" to "Ovis aries", "hum" to "humerus", and "Especie" to "Taxon", for instance.
```{r}
dataStandardized <- StandardizeDataSet(dataWithSummary)
knitr::kable(head(dataStandardized)[, -c(6:20,32:64,72:86)])
```

## Selecting only caprines
We may be interested in selecting all caprine elements. This can be done even without standardizing the data using the function `InCategory`:
```{r}
dataOC <- subset(dataWithSummary, InCategory(Especie, 
                                             SubtaxonomySet("caprine"),
                                             zoologThesaurus$taxon))
knitr::kable(head(dataOC)[, -c(6:20,32:64)])
```
Observe that no standardization is performed in the output subset. To standardize the subset data, `StandardizeDataSet` can be applied either before or after the subsetting.
```{r}
dataOCStandardized <- StandardizeDataSet(dataOC)
knitr::kable(head(dataOCStandardized)[, -c(6:20,32:64)])
```
Observe also that the distinction between *Ovis aries*, *Capra hircus*, and *Ovis/Capra* has not been removed from the data.

If we were interested only in one summary measure, *Width* or *Length*, we could  retain the cases including this measure:
```{r}
dataOCWithWidth <- RemoveNACases(dataOCStandardized, measureNames = "Width")
dataOCWithLength <- RemoveNACases(dataOCStandardized, measureNames = "Length")
```
which gives respectively `r nrow(dataOCWithWidth)` (`=nrow(dataOCWithWidth)`)  and `r nrow(dataOCWithLength)` (`=nrow(dataOCWithLength)`)  cases.

## Different plots for data visualisation
Condensed log values can be visualised as histograms and box plots using ggplot [@wickham2011ggplot2]. Here we will look at some examples of plotting values from caprines.

### Horizontal Boxplot with dots grouped by site

For the example plots we will use the package **ggplot2**.
```{r, echo = FALSE}
library(ggplot2)
```

We can now create a boxplot for the widths: 
```{r, fig.asp = 0.6, fig.width = 6, fig.align="center"}
ggplot(dataOCStandardized, aes(x = Site, y = Width)) +
  geom_boxplot(outlier.shape = NA, na.rm = TRUE) +
  geom_jitter(width = 0.2, height = 0, alpha = 1/2, color = 4, na.rm = TRUE) +
  theme_bw() +
  ggtitle("Caprine widths") +
  ylab("Width log-ratio") +
  coord_flip()
```

And another boxplot for the lengths:
```{r, fig.asp = 0.6, fig.width = 6, fig.align="center"}
ggplot(dataOCStandardized, aes(x = Site, y = Length)) +
  geom_boxplot(outlier.shape = NA, na.rm = TRUE) +
  geom_jitter(width = 0.2, height = 0, alpha = 1/2, color = 4, na.rm = TRUE) +
  theme_bw() +
  ggtitle("Caprine lengths") +
  ylab("Length log-ratio") +
  coord_flip()
```

### Histograms grouped by site
We may choose to plot the width data as a histogram: 
```{r, fig.asp = 0.7, fig.width = 6, fig.align="center"}
ggplot(dataOCStandardized, aes(Width)) +
  geom_histogram(bins = 30, na.rm = TRUE) +
  ggtitle("Caprine widths") +
  xlab("Width log-ratio") +
  facet_grid(Site ~.) +
  theme_bw() +
  theme(panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        axis.title.x = element_text(size = 10),
        axis.title.y = element_text(size = 10),
        axis.text = element_text(size = 10) ) +
  scale_y_continuous(breaks = c(0, 10, 20, 30))
```

### Vertical boxplot with dots grouped by taxon and site
Here we reorder the factor levels of `dataOCStandardized$Taxon` to make
the order of the boxplots more intuitive.

```{r}
levels0 <- unique(dataOCStandardized$Taxon)
levels0
```
```{r}
dataOCStandardized$Taxon <- factor(dataOCStandardized$Taxon, 
                                   levels = levels0[c(1,3,2)])
levels(dataOCStandardized$Taxon)
```
and assign specific colours for each category:
```{r, message = FALSE, fig.asp = 0.6, fig.width = 6, fig.align="center"}
Ocolour <- c("#A2A475", "#D8B70A", "#81A88D")
ggplot(dataOCStandardized, aes(x=Site, y=Width)) + 
  geom_boxplot(aes(fill=Taxon), 
               notch = TRUE, alpha = 0, lwd = 0.377, outlier.alpha = 0,
               width = 0.5, na.rm = TRUE,
               position = position_dodge(0.75),
               show.legend = FALSE) + 
  geom_point(aes(colour = Taxon, shape = Taxon), 
             alpha = 0.7, size = 0.8, 
             position = position_jitterdodge(jitter.width = 0.3),
             na.rm = TRUE) +
  scale_colour_manual(values=Ocolour) +
  scale_shape_manual(values=c(15, 18, 16)) +
  theme_bw(base_size = 8) +
  ylab("LSI value") +
  ggtitle("Sheep/goat LSI width values") 
```


### Histograms grouped by taxon and site  

```{r, warning = FALSE, fig.asp = 0.6, fig.width = 6, fig.align="center"}
TaxonSiteWidthHist <- ggplot(dataOCStandardized, aes(Width, fill = Taxon)) + 
  geom_histogram(bins = 30, alpha = 0.5, position = "identity") + 
  ggtitle("Sheep/goat Widths") + facet_grid(Site ~ Taxon)
TaxonSiteWidthHist
```
```{r, warning = FALSE, fig.asp = 0.6, fig.width = 6, fig.align="center"}
TaxonSiteWidthHist <- ggplot(dataOCStandardized, aes(Width, fill = Taxon)) + 
  geom_histogram(bins = 30, alpha = 0.5, position = "identity") + 
  ggtitle("Sheep/goat Widths") + facet_grid(~Site)
TaxonSiteWidthHist
```


## Statistical test
We may run a statistical test (here a Student t-test) to check whether the differences in log ratio length values between the sites "OLD" and "ALP" are statistically significant:
```{r}
t.test(Length ~ Site, dataOCStandardized, 
       subset = Site %in% c("OLD", "ALP"),
       na.action = "na.omit")
```

Similarly, for the differences in log ratio width values:
```{r}
t.test(Width ~ Site, dataOCStandardized, 
       subset = Site %in% c("OLD", "ALP"),
       na.action = "na.omit")
```

For testing all possible pairs of sites, the p-values must be adjusted for multiple comparisons:
```{r, message = FALSE}
library(stats)
pairwise.t.test(dataOCStandardized$Width, dataOCStandardized$Site, 
                pool.sd = FALSE)
```


# How to cite the package *zoolog*

We have invested a lot of time and effort in creating zoolog. Please cite the package if you publish an analysis or results obtained using *zoolog*.
For example, "Log ratios calculation and analysis was done using the package zoolog  [@pozo2022zoolog] in R 4.0.3 [@RCoreTeam2020]."

To get the details of the most recent version of the package, you can use the R citation function: `citation("zoolog")`.

When publishing it is also recommended that the references for standards are properly cited, as well as any details on selecting feature values. The details on the source of each of the reference standards are given in the help page [referencesDatabase](../help/referencesDatabase). For instance, if using the zoolog `reference$NietoDavisAlbarella` and the default selection method of features, a fair description may be: "Published references for cattle [@nieto2018element], sheep/goat [@davis1996measurements] and pigs [@albarella2005neolithic] were used as standards. One length and one width log ratio value from each specimen were included in the analysis, with values selected following the default zoolog 'priority' method [@trentacoste2018pre]: length values - GL, GLl, GLm, HTC; width values - BT, Bd, Bp, SD, Bfd, Bfp."

# Bibliography