sm.compositionality.Rd
Implementation of the Spike-Montague segmentation and measure of additive compositionality (Spike 2016), which finds the most predictive associations between meaning features and substrings. Computation is deterministic and fast.
sm.compositionality(x, y, groups = NULL, strict = FALSE) sm.segmentation(x, y, strict = FALSE)
x | a list or vector of character sequences specifying the signals to
be analysed. Alternatively, |
---|---|
y | a matrix or data frame with as many rows as there are signals,
indicating the presence/value of the different meaning dimensions along
columns (see section Meaning data format). If |
groups | a list or vector with as many items as strings, used to split
|
strict | logical: if |
sm.segmentation
provides detailed information about the most
predictably co-occurring segments for every meaning feature. It returns
a data frame with one row for every meaning feature, in descending order
of the mutual predictability from (and to) their corresponding string
segments. The data frame has the following columns:
N
The number of signal-meaning pairings in which this meaning feature was attested.
mp
The highest mutual predictability between this meaning feature and one (or more) segments that was found.
p
Significance levels of the given mutual predictability, i.e. the probability that the given mutual predictability level could be reached by chance. The calculation depends on the frequency of the meaning feature as well as the number and relative frequency of all substrings across all signals (see below).
ties
The number of substrings found in strings
which have this same level of mutual predictability with the meaning
feature.
segments
For strict=FALSE
: a list containing the
ties
substrings in descending order of their length (the
ordering is for convenience only and not inherently meaningful). When
strict=TRUE
, the lists of segments for each meaning feature
are all of the same length, with a meaningful relationship of the
order of segments across the different rows: every set of segments
which are found in the same position for each of the different
meaning features constitute a valid segmentation where the segments
occurrences in the actual signals do not overlap.
sm.compositionality
calculates the weighted average of the
mutual predictability of all meaning features and their most predictably
co-occurring strings, as computed by sm.segmentation
. The function
returns a data frame of three columns:
N
is the total number of signals (utterances) on which the computation
was based, M
the number of distinct meaning features attested across
all signals, and meanmp
the mean mutual predictability across all these
features, weighted by the features' relative frequency. When groups
is
not NULL
, the data frame contains one row for every group.
The algorithm works on compositional meanings that can be expressed as sets of categorical meaning features (see below), and does not take the order of elements into account. Rather than looking directly at how complex meanings are expressed, the measure really captures the degree to which a homonymy- and synonymy-free signalling system exists at the level of individual semantic features.
The segmentation algorithm provided by sm.segmentation()
scans through
all sub-strings found in strings
to find the pairings of meaning features
and sub-strings whose respective presence is most predictive of each
other. Mathematically, for every meaning feature \(f\in M\), it finds
the sub-string \(s_{ij}\) from the set of strings \(S\) that yields the
highest mutual predictability across all signals,
$$mp(f,S) = \max_{s_{ij}\in S}\ P(f|s_{ij}) \cdot P(s_{ij}|f)\;.$$
Based on the mutual predictability levels obtained for the individual
meaning features, sm.compositionality
then computes the mean mutual
predictability weighted by the individual features' relative frequencies of
attestation, i.e.
$$mp(M,S) = \sum_{f\in M} freq_f \cdot mp(f,S)\;,$$
as a measure of the overall compositionality of the signalling system.
Since mutual predictability is determined seperately for every meaning
feature, the most predictive sub-strings posited for different meaning
features as returned by sm.segmentation()
can overlap, and even coincide
completely. Such results are generally indicative of either limited data
(in particular frequent co-occurrence of the meaning features in question),
or spurious results in the absence of a consistent signalling system. The
latter will also be indicated by the significance level of the given mutual
predictability.
A perfectly unambiguous mapping between a meaning feature to a specific
string segment will always yield a mutual predictability of 1
. In the
absence of such a regular mapping, on the other hand, chance co-occurrences
of strings and meanings will in most cases stop the mutual predictability
from going all the way down to 0
. In order to help distinguish chance
co-occurrence levels from significant signal-meaning associations,
sm.segmentation()
provides significance levels for the mutual
predictability levels obtained for each meaning feature.
What is the baseline level of association between a meaning feature and a set of sub-strings that we would expect to be due to chance co-occurrences? This depends on several factors, from the number of data points on which the analysis is based to the frequency of the meaning feature in question and, perhaps most importantly, the overall makeup of the different substrings that are present in the signals. Since every substring attested in the data is a candidate for signalling the presence of a meaning feature, the absolute number of different substrings greatly affects the likelihood of chance signal-meaning associations. (Diversity of the set of substrings is in turn heavily influenced by the size of the underlying alphabet, a factor which is often not appreciated.)
For every candidate substring, the degree of association with a specific meaning feature that we would expect by chance is again dependent on the absolute number of signals in which the substring is attested.
Starting from the simplest case, take a meaning that is featured in \(m\) of the total \(n\) signals (where \(0 < m \leq n\)). Assume next that there is a string segment that is attested in \(s\) of these signals (where again \(0 < s \leq n\)). The degree of association between the meaning feature and string segment is dependent on the number of times that they co-occur, which can be no more than \(c_{max} = min(m, s)\) times. The null probability of getting a given number of co-occurrences can be obtained by considering all possible reshufflings of the meaning feature in question across all signals: if \(s\) signals contain a given substring, how many of \(s\) randomly drawn signals from the pool of \(n\) signals would contain the meaning feature if a total of \(m\) signals in the pool did? Approached from this angle, the likelihood of the number of co-occurrences follows the hypergeometric distribution, with \(c\) being the number of successes when taking \(s\) draws without replacement from a population of size \(n\) with fixed number of successes \(m\).
For every number of co-occurrences \(c \in [0, c_{max}]\), one can compute the corresponding mutual probability level as \(p(c|s) \cdot p(c|m)\) to obtain the null distribution of mutual predictability levels between a meaning feature and one substring of a particular frequency \(s\): $$Pr(mp = p(c|s) \cdot p(c|m)) = f(k=c; N=n, K=m, n=s)$$
From this, we can now derive the null distribution for the entire set of attested substrings as follows: making the simplifying assumption that the occurrences of different substrings are independent of each other, we first aggregate over the null distributions of all the individual substrings to obtain the mean probability \(p=Pr(X\ge mp)\) of finding a given mutual predictability level at least as high as \(mp\) for one randomly drawn string from the entire population of substrings. Assuming the total number of candidate substrings is \(|S|\), the overall null probability that at least one of them would yield a mutual predictability at least as high is $$Pr(X\ge 0), X \equiv B(n=|S|, p=p)\;.$$
Note that, since the null distribution also depends on the frequency with which the meaning feature is attested, the significance levels corresponding to a given mutual predictability level are not necessarily identical for all meaning features, even within one analysis.
(In theory, one can also compute an overall p-value of the weighted mean
mutual predictability as calculated by sm.compositionality
. However, the
significance levels for the individual meaning features are much more
insightful and should therefore be consulted directly.)
The meanings
argument can be a matrix or data frame in one of two formats.
If it is a matrix of logicals (TRUE
/FALSE
values), then the columns are
assumed to refer to meaning features, with individual cells indicating
whether the meaning feature is present or absent in the signal represented
by that row (see binaryfeaturematrix()
for an explanation). If meanings
is a data frame or matrix of any other type, it is assumed that the columns
specify different meaning dimensions, with the cell values showing the
levels with which the different dimensions can be realised. This
dimension-based representation is automatically converted to a
feature-based one by calling binaryfeaturematrix()
. As a consequence,
whatever the actual types of the columns in the meaning matrix, they will
be treated as categorical factors for the purpose of this algorithm, also
discarding any explicit knowledge of which 'meaning dimension' they might
belong to.
Spike, M. 2016 Minimal requirements for the cultural evolution of language. PhD thesis, The University of Edinburgh. http://hdl.handle.net/1842/25930.
# perfect communication system for two meaning features (which are marked # as either present or absent) sm.compositionality(c("a", "b", "ab"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)))#> N M meanmp #> 1 3 2 1sm.segmentation(c("a", "b", "ab"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)))#> N mp p ties segments #> a 2 1 0.529 1 a #> b 2 1 0.529 1 b #> #> Mean feature-wise mutual predictability, weighted by feature frequency: 1# not quite perfect communication system sm.compositionality(c("as", "bas", "basf"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)))#> N M meanmp #> 1 3 2 0.8333333sm.segmentation(c("as", "bas", "basf"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)))#> N mp p ties segments #> b 2 1.000 0.651 3 bas, ba, b #> a 2 0.667 0.994 3 as, a, s #> #> Mean feature-wise mutual predictability, weighted by feature frequency: 0.833# same communication system, but force candidate segments to be non-overlapping # via the 'strict' option sm.segmentation(c("as", "bas", "basf"), cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)), strict=TRUE)#>#> N mp p ties segments #> b 2 1.000 0.651 2 s, ba #> a 2 0.667 0.994 2 as, b #> #> Mean feature-wise mutual predictability, weighted by feature frequency: 0.833# the function also accepts meaning-dimension based matrix definitions: print(twobytwoanimals <- enumerate.meaningcombinations(c(animal=2, colour=2)))#> animal colour #> [1,] 1 3 #> [2,] 1 4 #> [3,] 2 3 #> [4,] 2 4# note how there are many more candidate segments than just the full length # ones. the less data we have, the more likely it is that shorter substrings # will be just as predictable as the full segments that contain them. sm.segmentation(c("greendog", "bluedog", "greencat", "bluecat"), twobytwoanimals)#> N mp p ties segments #> animal=1 2 1 0.996 5 dog, do,.... #> animal=2 2 1 0.996 6 cat, ca,.... #> colour=3 2 1 0.996 12 green, g.... #> colour=4 2 1 0.996 9 blue, bl.... #> #> Mean feature-wise mutual predictability, weighted by feature frequency: 1# perform the same analysis, but using the formula interface print(twobytwosignalingsystem <- cbind(twobytwoanimals, signal=c("greendog", "bluedog", "greencat", "bluecat")))#> animal colour signal #> [1,] "1" "3" "greendog" #> [2,] "1" "4" "bluedog" #> [3,] "2" "3" "greencat" #> [4,] "2" "4" "bluecat"sm.segmentation(signal ~ colour + animal, twobytwosignalingsystem)#> N mp p ties segments #> colour=3 2 1 0.996 12 green, g.... #> colour=4 2 1 0.996 9 blue, bl.... #> animal=1 2 1 0.996 5 dog, do,.... #> animal=2 2 1 0.996 6 cat, ca,.... #> #> Mean feature-wise mutual predictability, weighted by feature frequency: 1# since there is no overlap in the constituent characters of the identified # 'morphemes', they are all tied in their mutual predictiveness with the # (shorter) substrings they contain # # to reduce the pool of candidate segments to those which are # non-overlapping and of maximal length, again use the 'strict=TRUE' option: sm.segmentation(signal ~ colour + animal, twobytwosignalingsystem, strict=TRUE)#>#> N mp p ties segments #> colour=3 2 1 0.996 1 green #> colour=4 2 1 0.996 1 blue #> animal=1 2 1 0.996 1 dog #> animal=2 2 1 0.996 1 cat #> #> Mean feature-wise mutual predictability, weighted by feature frequency: 1