ContextSnake grammatical inference pattern

a Markov chain/generative grammar approach with self-adjusting context depth


concept by Gerhard Nierhaus,  implementation Alberto de Campo.

comments and suggestions welcome! 


Based on a corpus of examples, a ContextSnake creates a stream of values:

For each next value, a number of previous values define the context depth; 

the longer the context considered, the less likely the context is to occur in its full length. 

The snake extends the context until it becomes unique, i.e. until it only occurs 

at a single location in the corpus; then it reduces context depth again so that more 

than one option becomes available, and picks the next item from these options. 

(Like a gourmet, it insists on choices.)


Two parameters influence these choices further: minDepth is the minimum context depth below which single choices are accepted), and acceptSingleCond is a flag or boolean function whether to accept single choices even when above minimum depth.

 

 

  // Demonstrate the algorithm by hand first: 

  // a corpus of two example strings - text sequences are easiest to follow.

a = [ "was it a cat or a car i saw", "warsaw" ]; 

// and a context snake with that corpus.

c = ContextSnake(a);


// find all locations of the sublist given in corpus; 

// returns list of index pairs [which example in corpus, starting where in example]


c.find("cat"); // "cat" is unique, i.e. only found once in corpus: 

// in example 0, starting at index 9 

c.find("ca"); // "ca" occurs twice, so two locations 


// do context-snake decisions by hand:

c.find("w"); // 4 locations, pick first of them as next (here, by hand):

// at [0, 0], next letter is "a"; so the context is now "wa".

c.find("wa"); // still 2 locations, pick first

c.find("was"); // 1 unique location; reduce context - drop the oldest character

c.find("as"); // 1 unique location; reduce context again

c.find("s"); // 3 locations; pick the last one

c.find("sa"); // 2 locations; pick the latter

c.find("saw"); // still 2 locations; but both have no next letter -> end.


// The entire string produced by now is: "wasaw".



*new(corpus, starter, minDepth, acceptSingleCond, starterLength, verbose); 

corpus is the corpus from which to generate new streams,

starter is an optional start sequence (usually the beginning of a corpus specimen),

minDepth is the minimum context depth below which single choices are also accepted,

acceptSingleCond

c = ContextSnake(a, "ca", 1).verbose_(true); // a snake, posting all its decisions.

e = c.asStream; 

e.next; // get a next value; do this several times!

e.next; 

e.next; 


e.nextN(5).join.postcs; // get several values at once, stream may end and only return nils.


c.verbose_(false); // turn off long posting


// collect all next values until stream ends with nil; 

// for strings, .join makes them more readable.

c.asStream.all.join; 


5.do { c.asStream.all.join.postcs };


c = ContextSnake(a, "w", 2); // try longer minimum context

c.asStream.all.join;

// slightly different corpus

a = [ "was it a cat or a car i saw", "there once was a saw in warsaw" ]; 

c = ContextSnake(a, "s", 4); // try longer context again

c.asStream.all.join; // collect all next values until stream ends



Fairytales and poetry can make good material, e.g. we experimented with 

Schneewittchen (Snow white), Erlkoenig, and others, see ContextSnake_3texts.html.

Formulaic segments can become points of 'modulation' between sections. 




A very simple music example:


// melody which can go up only in steps, and down only in thirds;

// can begin with 0, 7, or 6; can end with 7, 4 or 0.

b = [ (0..7), [7, 5, 3, 1, 2, 3, 4], [6, 4, 2, 0]];


c = ContextSnake(b, minDepth: 1, starterLength: 1);


10.do { c.asStream.all.postln };



// add pauses between new starts

Pbind(\degree, Pseq([Plazy({ c.randStarter }), \_, \_], inf), \dur, 0.2).trace(\degree).play;

s.boot;



// Add special markers for the beginning and ending to each phrase; 

// this can be helpful for doing special things on segment borders.

(

b = [ (0..7), [7, 5, 3, 1, 2, 3, 4], [6, 4, 2, 0]];

d = b.collect { |ex| ['<'] ++ ex ++ ['>'] }; // add

c = ContextSnake(d, minDepth: 1, starterLength: 1);

)

( // beginning and ending symbols create pauses 

Pbind(

\degree, Pn(Plazy({ c.randStarter }), inf), 

\dur, 0.2

).trace(\degree).play;

)


For a musical example with Palestrina melodies, see ContextSnake_Palestrina.html.



Analysis 


ContextSnakes can analyse a sample for different criteria, for example, what the longest sequences are that occur verbatim in the corpus. If the found sequences overlap, the sample is a possible production from the given corpus. Other methods test whether the sample is valid output of the snake, and whether it is new or not.   

(

a = [ 

"ContextSnakes can analyse a sample for different criteria, for example, what the longest sequences are that occur verbatim in the corpus.", 

"If the found sequences overlap, the sample is a possible production from the given corpus.", "Other methods test whether the sample is valid output, and whether it is new or not."

];

c = ContextSnake(a, minDepth: 4); 

)

10.do { c.randStarter.asStream.all.join.postcs }; // produce something


b = "If the given corpus."; // a short possible production;


longestSnippets (sample)

c.longestSnippets(b).printcsAll // where the segments are;


isValidOutput(sample, snippets)

tests whether a sample is a possible production of the snake.

if you have the snippets already, you can pass them in for efficiency. 

also posts the amounts of overlap, in effect telling you likely minimum context depth.

c.isValidOutput(b); // yes

c.isValidOutput("longest samples"); // yes

c.isValidOutput("shortest samples"); // zeroes show that no overlap, though all the letters are OK.

c.isValidOutput("context XYZ impossible."); // negative numbers show non-occurring letters.

isNew(sample)

tests whether a sample is a new production of the pattern.

Sometimes the snake may regenerate parts of the corpus, 

which should be identified as not new. 

c.isNew(b); // is b a new production? yes.

c.isNew("samples"); // yes, new.

c.isNew("sample"); // no, occurs as is.

c.isNew(a.first); // line 0 from the corpus is also not new.


vocabulary () 

all the elements occurring in the corpus; can be used for testing. 

c.vocabulary.postcs; 

c.vocabulary.includesAll(b);

c.vocabulary.includesAll("shortest samples");

c.vocabulary.includesAll("context XYZ impossible.");



/****** possible extensions: 

* force switching when segments get too long? (how to check for that?)

* choosing could have tendencies, e.g. prefer to stay close, 

or prefer to switch examples whenever possible. 

******/