Speech recognition

In this example, we are going to explore a possible usage of streaming process mining in the context of speech recognition: each word a person is saying can be recognized as an activity and the sentences these words belong to can be the process instances. Every time a person waits a considerable amount of time between words, we can assume a new sentence is being said and thus a new case should be generated.

To accomplish our goal we need to define a new BeamlineAbstractSource which can listen to the microphone, perform the speech recognition, and generate corresponding events. For the speech recognition we are going to use the Vosk speech recognition toolkit. We also going to use the vosk-model-small-en-us-0.15 model which is available on the library website.

First we need to setup the Maven dependencies by setting them in the pom.xml file:

<dependency>
   <groupId>com.alphacephei</groupId>
   <artifactId>vosk</artifactId>
   <version>0.3.33</version>
</dependency>
<dependency>
   <groupId>org.json</groupId>
   <artifactId>json</artifactId>
   <version>20211205</version>
</dependency>

Then, once the model folder is properly set, we can configure the source SpeechRecognizerSource. The main idea is to construct a new thread which remains listening for speech and translates it to a string. this has to be inserted into a (potentially never-ending) loop. Within the loop, we can extract the array of all words said until now with:

// getPartialResult returns a json object that we convert into its only string object
String text = (String) new JSONObject(recognizer.getPartialResult()).get("partial");
if (text.isEmpty()) {
   // if the text is empty we can skip this round
   continue;
}
// split the sentence into the individual words
String[] words = text.split(" ");

After the new words being said are identified (code not reported here), it is possible to construct the event with:

// processing new case ids
if (lastWordMillisecs + MILLISECS_FOR_NEW_CASE < System.currentTimeMillis()) {
   caseId++;
}
lastWordMillisecs = System.currentTimeMillis();

// prepare the actual event
buffer.offer(BEvent.create("speech", "case-" + caseId, word));

Where buffer is a buffer used for storing events before they are dispatched to the other operators and MILLISECS_FOR_NEW_CASE is a long indicating how many milliseconds separate each sentence (and hence creates a new case identifier).

A simple consumer, in this case the Trivial discovery miner, can then be attached to the source with:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env
   .addSource(new SpeechRecognitionSource())
   .keyBy(BEvent::getProcessName)
   .flatMap(new DirectlyFollowsDependencyDiscoveryMiner()
         .setModelRefreshRate(1)
         .setMinDependency(0))
   .addSink(new SinkFunction<ProcessMap>(){
      public void invoke(ProcessMap value, Context context) throws Exception {
         System.out.println(value.getProcessedEvents());
         value.generateDot().exportToSvg(new File("src/main/resources/output/output.svg"));
      };
   });
env.execute();

In the following example, I tested the system saying two sentences:

  • "hello my name is peter"
  • "good morning my name is bob"

The result of the processing is shown below:

image/svg+xml G e76ff4afd-c262-4cfe-904d-0468d17f3e76->eb7dc8274-6332-4dc7-9522-b778782e936f eb73e3c89-127b-4ee2-9704-b2c34178ec08->e7642d3c2-398a-436c-bd9a-06bd01cbe9b3 0.50 (1) e8af4e578-ab29-42d8-a4e0-ac2ca9f416b0->e889af76b-8c58-4b36-be88-97ae59fc4b1f 1.0 (2) e889af76b-8c58-4b36-be88-97ae59fc4b1f->e76ff4afd-c262-4cfe-904d-0468d17f3e76 0.50 (1) e889af76b-8c58-4b36-be88-97ae59fc4b1f->e9f0d9c0c-45bb-43f2-8fd2-1d1114895797 0.50 (1) e7585ada9-8ea0-49d2-b46a-cae864aa78ae->e7642d3c2-398a-436c-bd9a-06bd01cbe9b3 0.50 (1) e7642d3c2-398a-436c-bd9a-06bd01cbe9b3->e8af4e578-ab29-42d8-a4e0-ac2ca9f416b0 1.0 (2) e9f0d9c0c-45bb-43f2-8fd2-1d1114895797->eb7dc8274-6332-4dc7-9522-b778782e936f eafc5561b-c3c5-452e-8123-b8bfdc453ee2->eb73e3c89-127b-4ee2-9704-b2c34178ec08 eafc5561b-c3c5-452e-8123-b8bfdc453ee2->e7585ada9-8ea0-49d2-b46a-cae864aa78ae e76ff4afd-c262-4cfe-904d-0468d17f3e76 peter 0.50 (1) eb73e3c89-127b-4ee2-9704-b2c34178ec08 good morning 0.50 (1) e8af4e578-ab29-42d8-a4e0-ac2ca9f416b0 name 1.0 (2) e889af76b-8c58-4b36-be88-97ae59fc4b1f is 1.0 (2) e7585ada9-8ea0-49d2-b46a-cae864aa78ae hello 0.50 (1) e7642d3c2-398a-436c-bd9a-06bd01cbe9b3 my 1.0 (2) e9f0d9c0c-45bb-43f2-8fd2-1d1114895797 bob 0.50 (1) eafc5561b-c3c5-452e-8123-b8bfdc453ee2 eb7dc8274-6332-4dc7-9522-b778782e936f

The two sentences are recognized properly. It is worth noticing that on the second sentence the first 2 words (good morning) have been recognized together, probably because I've said them very quickly, one next to the other.

The complete code of this example is available in the GitHub repository https://github.com/beamline/examples/tree/master/src/main/java/beamline/examples/speechRecognition.