Spike's segmentation and measure of additive compositionality.

Implementation of the Spike-Montague segmentation and measure of additive compositionality (Spike 2016), which finds the most predictive associations between meaning features and substrings. Computation is deterministic and fast.

sm.compositionality(x, y, groups = NULL, strict = FALSE)

sm.segmentation(x, y, strict = FALSE)

Arguments

x	a list or vector of character sequences specifying the signals to be analysed. Alternatively, `x` can also be a formula of the format `s ~ m1 + m2 + ...`, where `s` and `m1`, `m2`, etc. specify the column names of the signals and meaning features found in the data frame that is passed as the second argument.
y	a matrix or data frame with as many rows as there are signals, indicating the presence/value of the different meaning dimensions along columns (see section Meaning data format). If `x` is a formula, the `y` data frame can contain any number of columns, but only the ones whose column name is specified in the formula will be considered.
groups	a list or vector with as many items as strings, used to split `strings` and `meanings` into data sets for which compositionality measures are computed separately.
strict	logical: if `TRUE`, perform additional filtering of candidate segments. In particular, it removes combinations of segments (across meanings) which overlap in at least one of the strings where they co-occur. For convenience, it also removes segments which are shorter substrings of longer candidates (for the same meaning feature).

Value

sm.segmentation provides detailed information about the most predictably co-occurring segments for every meaning feature. It returns a data frame with one row for every meaning feature, in descending order of the mutual predictability from (and to) their corresponding string segments. The data frame has the following columns:

N: The number of signal-meaning pairings in which this meaning feature was attested.
mp: The highest mutual predictability between this meaning feature and one (or more) segments that was found.
p: Significance levels of the given mutual predictability, i.e. the probability that the given mutual predictability level could be reached by chance. The calculation depends on the frequency of the meaning feature as well as the number and relative frequency of all substrings across all signals (see below).
ties: The number of substrings found in strings which have this same level of mutual predictability with the meaning feature.
segments: For strict=FALSE: a list containing the ties substrings in descending order of their length (the ordering is for convenience only and not inherently meaningful). When strict=TRUE, the lists of segments for each meaning feature are all of the same length, with a meaningful relationship of the order of segments across the different rows: every set of segments which are found in the same position for each of the different meaning features constitute a valid segmentation where the segments occurrences in the actual signals do not overlap.

sm.compositionality calculates the weighted average of the mutual predictability of all meaning features and their most predictably co-occurring strings, as computed by sm.segmentation. The function returns a data frame of three columns: N is the total number of signals (utterances) on which the computation was based, M the number of distinct meaning features attested across all signals, and meanmp the mean mutual predictability across all these features, weighted by the features' relative frequency. When groups is not NULL, the data frame contains one row for every group.

Details

The algorithm works on compositional meanings that can be expressed as sets of categorical meaning features (see below), and does not take the order of elements into account. Rather than looking directly at how complex meanings are expressed, the measure really captures the degree to which a homonymy- and synonymy-free signalling system exists at the level of individual semantic features.

The segmentation algorithm provided by sm.segmentation() scans through all sub-strings found in strings to find the pairings of meaning features and sub-strings whose respective presence is most predictive of each other. Mathematically, for every meaning feature $f\in M$, it finds the sub-string $s_{ij}$ from the set of strings $S$ that yields the highest mutual predictability across all signals, $$mp(f,S) = \max_{s_{ij}\in S}\ P(f|s_{ij}) \cdot P(s_{ij}|f)\;.$$

Based on the mutual predictability levels obtained for the individual meaning features, sm.compositionality then computes the mean mutual predictability weighted by the individual features' relative frequencies of attestation, i.e. $$mp(M,S) = \sum_{f\in M} freq_f \cdot mp(f,S)\;,$$ as a measure of the overall compositionality of the signalling system.

Since mutual predictability is determined seperately for every meaning feature, the most predictive sub-strings posited for different meaning features as returned by sm.segmentation() can overlap, and even coincide completely. Such results are generally indicative of either limited data (in particular frequent co-occurrence of the meaning features in question), or spurious results in the absence of a consistent signalling system. The latter will also be indicated by the significance level of the given mutual predictability.

Null distribution and p-value calculation

A perfectly unambiguous mapping between a meaning feature to a specific string segment will always yield a mutual predictability of 1. In the absence of such a regular mapping, on the other hand, chance co-occurrences of strings and meanings will in most cases stop the mutual predictability from going all the way down to 0. In order to help distinguish chance co-occurrence levels from significant signal-meaning associations, sm.segmentation() provides significance levels for the mutual predictability levels obtained for each meaning feature.

What is the baseline level of association between a meaning feature and a set of sub-strings that we would expect to be due to chance co-occurrences? This depends on several factors, from the number of data points on which the analysis is based to the frequency of the meaning feature in question and, perhaps most importantly, the overall makeup of the different substrings that are present in the signals. Since every substring attested in the data is a candidate for signalling the presence of a meaning feature, the absolute number of different substrings greatly affects the likelihood of chance signal-meaning associations. (Diversity of the set of substrings is in turn heavily influenced by the size of the underlying alphabet, a factor which is often not appreciated.)

For every candidate substring, the degree of association with a specific meaning feature that we would expect by chance is again dependent on the absolute number of signals in which the substring is attested.

Starting from the simplest case, take a meaning that is featured in $m$ of the total $n$ signals (where $0 < m \leq n$). Assume next that there is a string segment that is attested in $s$ of these signals (where again $0 < s \leq n$). The degree of association between the meaning feature and string segment is dependent on the number of times that they co-occur, which can be no more than $c_{max} = min(m, s)$ times. The null probability of getting a given number of co-occurrences can be obtained by considering all possible reshufflings of the meaning feature in question across all signals: if $s$ signals contain a given substring, how many of $s$ randomly drawn signals from the pool of $n$ signals would contain the meaning feature if a total of $m$ signals in the pool did? Approached from this angle, the likelihood of the number of co-occurrences follows the hypergeometric distribution, with $c$ being the number of successes when taking $s$ draws without replacement from a population of size $n$ with fixed number of successes $m$.

For every number of co-occurrences $c \in [0, c_{max}]$, one can compute the corresponding mutual probability level as $p(c|s) \cdot p(c|m)$ to obtain the null distribution of mutual predictability levels between a meaning feature and one substring of a particular frequency $s$: $$Pr(mp = p(c|s) \cdot p(c|m)) = f(k=c; N=n, K=m, n=s)$$

From this, we can now derive the null distribution for the entire set of attested substrings as follows: making the simplifying assumption that the occurrences of different substrings are independent of each other, we first aggregate over the null distributions of all the individual substrings to obtain the mean probability $p=Pr(X\ge mp)$ of finding a given mutual predictability level at least as high as $mp$ for one randomly drawn string from the entire population of substrings. Assuming the total number of candidate substrings is $|S|$, the overall null probability that at least one of them would yield a mutual predictability at least as high is $$Pr(X\ge 0), X \equiv B(n=|S|, p=p)\;.$$

Note that, since the null distribution also depends on the frequency with which the meaning feature is attested, the significance levels corresponding to a given mutual predictability level are not necessarily identical for all meaning features, even within one analysis.

(In theory, one can also compute an overall p-value of the weighted mean mutual predictability as calculated by sm.compositionality. However, the significance levels for the individual meaning features are much more insightful and should therefore be consulted directly.)

Meaning data format

The meanings argument can be a matrix or data frame in one of two formats. If it is a matrix of logicals (TRUE/FALSE values), then the columns are assumed to refer to meaning features, with individual cells indicating whether the meaning feature is present or absent in the signal represented by that row (see binaryfeaturematrix() for an explanation). If meanings is a data frame or matrix of any other type, it is assumed that the columns specify different meaning dimensions, with the cell values showing the levels with which the different dimensions can be realised. This dimension-based representation is automatically converted to a feature-based one by calling binaryfeaturematrix(). As a consequence, whatever the actual types of the columns in the meaning matrix, they will be treated as categorical factors for the purpose of this algorithm, also discarding any explicit knowledge of which 'meaning dimension' they might belong to.

References

Spike, M. 2016 Minimal requirements for the cultural evolution of language. PhD thesis, The University of Edinburgh. http://hdl.handle.net/1842/25930.

Examples

# perfect communication system for two meaning features (which are marked
# as either present or absent)
sm.compositionality(c("a", "b", "ab"),
  cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)))
#>   N M meanmp
#> 1 3 2      1
sm.segmentation(c("a", "b", "ab"),
  cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)))
#>   N mp     p ties segments
#> a 2  1 0.529    1        a
#> b 2  1 0.529    1        b
#> 
#> Mean feature-wise mutual predictability, weighted by feature frequency: 1 

# not quite perfect communication system
sm.compositionality(c("as", "bas", "basf"),
  cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)))
#>   N M    meanmp
#> 1 3 2 0.8333333
sm.segmentation(c("as", "bas", "basf"),
  cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)))
#>   N    mp     p ties   segments
#> b 2 1.000 0.651    3 bas, ba, b
#> a 2 0.667 0.994    3   as, a, s
#> 
#> Mean feature-wise mutual predictability, weighted by feature frequency: 0.833 

# same communication system, but force candidate segments to be non-overlapping
# via the 'strict' option
sm.segmentation(c("as", "bas", "basf"),
  cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)), strict=TRUE)
#> Applying strict selection, checking 9 segment combinations for overlap
#>   N    mp     p ties segments
#> b 2 1.000 0.651    2    s, ba
#> a 2 0.667 0.994    2    as, b
#> 
#> Mean feature-wise mutual predictability, weighted by feature frequency: 0.833 


# the function also accepts meaning-dimension based matrix definitions:
print(twobytwoanimals <- enumerate.meaningcombinations(c(animal=2, colour=2)))
#>      animal colour
#> [1,]      1      3
#> [2,]      1      4
#> [3,]      2      3
#> [4,]      2      4

# note how there are many more candidate segments than just the full length
# ones. the less data we have, the more likely it is that shorter substrings
# will be just as predictable as the full segments that contain them.
sm.segmentation(c("greendog", "bluedog", "greencat", "bluecat"), twobytwoanimals)
#>          N mp     p ties     segments
#> animal=1 2  1 0.996    5 dog, do,....
#> animal=2 2  1 0.996    6 cat, ca,....
#> colour=3 2  1 0.996   12 green, g....
#> colour=4 2  1 0.996    9 blue, bl....
#> 
#> Mean feature-wise mutual predictability, weighted by feature frequency: 1 

# perform the same analysis, but using the formula interface
print(twobytwosignalingsystem <- cbind(twobytwoanimals,
  signal=c("greendog", "bluedog", "greencat", "bluecat")))
#>      animal colour signal    
#> [1,] "1"    "3"    "greendog"
#> [2,] "1"    "4"    "bluedog" 
#> [3,] "2"    "3"    "greencat"
#> [4,] "2"    "4"    "bluecat" 

sm.segmentation(signal ~ colour + animal, twobytwosignalingsystem)
#>          N mp     p ties     segments
#> colour=3 2  1 0.996   12 green, g....
#> colour=4 2  1 0.996    9 blue, bl....
#> animal=1 2  1 0.996    5 dog, do,....
#> animal=2 2  1 0.996    6 cat, ca,....
#> 
#> Mean feature-wise mutual predictability, weighted by feature frequency: 1 

# since there is no overlap in the constituent characters of the identified
# 'morphemes', they are all tied in their mutual predictiveness with the
# (shorter) substrings they contain
#
# to reduce the pool of candidate segments to those which are
# non-overlapping and of maximal length, again use the 'strict=TRUE' option:

sm.segmentation(signal ~ colour + animal, twobytwosignalingsystem, strict=TRUE)
#> Applying strict selection, checking 3240 segment combinations for overlap
#>          N mp     p ties segments
#> colour=3 2  1 0.996    1    green
#> colour=4 2  1 0.996    1     blue
#> animal=1 2  1 0.996    1      dog
#> animal=2 2  1 0.996    1      cat
#> 
#> Mean feature-wise mutual predictability, weighted by feature frequency: 1