Find a segmentation that maximises the overall string coverage across all signals.

This algorithm builds on Spike's measure of compositionality (see sm.compositionality), except instead of simply determining which segment(s) have the highest mutual predictability for each meaning feature separately, it attempts to find a combination of non-overlapping segments for each feature that maximises the overall string coverage over all signals. In other words, it tries to find a segmentation which can account for (or 'explain') as much of the string material in the signals as possible.

ssm.compositionality(x, y, groups = NULL)

ssm.segmentation(x, y, mergefeatures = FALSE, verbose = FALSE)

Arguments

x	a list or vector of character sequences
y	a matrix or data frame with as many rows as there are strings (see section Meaning data format)
groups	a list or vector with as many items as strings, used to split the signals and meanings into data sets for which the compositionality measures are computed separately.
mergefeatures	logical: if `TRUE`, `ssm.segmentation` will try to improve on the initial solution by incrementally merging pairs of meaning features as long as doing so improves the overall string coverage of the segmentation.
verbose	logical: if `TRUE`, messages detailed information about the number of segment combinations considered for every coverage computed.

Details

For large data sets and long strings, this computation can get very slow. If the attested signals are such that no perfect segmentation is possible, this algorithm is not guaranteed to find any segmentation (as no such segmentation might exist).

Examples

ssm.segmentation(c("as", "bas", "basf"),
  cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)))
#> Checking 9 segment combinations for overlaps...
#> Initial segmentation covers 6 of 9 characters, mean mp 0.833
#>   N matches matchrate    mp     p segments
#> b 2       2         1 1.000 0.651        b
#> a 2       2         1 0.667 0.994       as
#> 
#> Mean feature-wise mutual predictability, weighted by feature frequency: 0.833 
#> Mean signal-wise character coverage, weighted by features per signal: 0.708 
#> 
#> Segmentation is based on 3 signals totalling 9 characters.
#> Discounting overlaps, the segmentation above accounts for 6 of those characters.
#> Total character coverage rate: 0.667 


# signaling system where one meaning distinction is not encoded in the signals
print(threebytwoanimals <- enumerate.meaningcombinations(list(animal=c("dog", "cat", "tiger"),
  colour=c("col1", "col2"))))
#>      animal  colour
#> [1,] "dog"   "col1"
#> [2,] "dog"   "col2"
#> [3,] "cat"   "col1"
#> [4,] "cat"   "col2"
#> [5,] "tiger" "col1"
#> [6,] "tiger" "col2"

ssm.segmentation(c("greendog", "bluedog", "greenfeline", "bluefeline", "greenfeline", "bluefeline"),
  threebytwoanimals)
#> Checking 1 segment combinations for overlaps...
#> Initial segmentation covers 57 of 57 characters, mean mp 0.833
#>              N matches matchrate  mp     p segments
#> animal=dog   2       2         1 1.0 0.982      dog
#> colour=col1  3       3         1 1.0 0.615    green
#> colour=col2  3       3         1 1.0 0.615     blue
#> animal=cat   2       2         1 0.5     1   feline
#> animal=tiger 2       2         1 0.5     1   feline
#> 
#> Mean feature-wise mutual predictability, weighted by feature frequency: 0.833 
#> Mean signal-wise character coverage, weighted by features per signal: 1 
#> 
#> Segmentation is based on 6 signals totalling 57 characters.
#> Discounting overlaps, the segmentation above accounts for 57 of those characters.
#> Total character coverage rate: 1 

# the same analysis again, but allow merging of features
ssm.segmentation(c("greendog", "bluedog", "greenfeline", "bluefeline", "greenfeline", "bluefeline"),
  threebytwoanimals, mergefeatures=TRUE)
#> Checking 1 segment combinations for overlaps...
#> Initial segmentation covers 57 of 57 characters, mean mp 0.833
#> Merging animal=cat|animal=tiger improved coverage to 57 out of 57, mean mp 1
#>                         N matches matchrate mp     p segments
#> animal=dog              2       2         1  1 0.982      dog
#> colour=col1             3       3         1  1 0.615    green
#> colour=col2             3       3         1  1 0.615     blue
#> animal=cat|animal=tiger 4       4         1  1 0.701   feline
#> 
#> Mean feature-wise mutual predictability, weighted by feature frequency: 1 
#> Mean signal-wise character coverage, weighted by features per signal: 1 
#> 
#> Segmentation is based on 6 signals totalling 57 characters.
#> Discounting overlaps, the segmentation above accounts for 57 of those characters.
#> Total character coverage rate: 1

Find a segmentation that maximises the overall string coverage across all signals.

Arguments

Details

See also

Examples

Contents