An overview of text simplification in SIMPATICO

One of the key assets in SIMPATICO technology toolkit is the use of Natural Language Processing (NLP) tools to make the system automatically adapt texts in Public Administration’s e-services. These are often notoriously complex and difficult to follow for citizens and simplifying them using manual labour is expensive and slow so an automatic approach is proposed in our project.

For the NLP tools to achieve the overall goal, the system needs to follow two complementary strategies:

  • Lexical simplification, or the substitution of complex words or concepts into simpler ones that have a similar meaning. For example, in the phrase ‘it was a joyous occasion’ could be substituted for ‘it was a happy occasion’.
  • Syntactic simplification, or the substitution of complex sentences and structures for less complicated alternatives for readers. Common examples include the substitution of passive voices for active voices.

To achieve these results, the text in the e-services require to be analysed by a multitude of tools. In the image below we can see a simple example of such analytic stages in action.

We see how the system first needs to recognize the word or words that can be troublesome for the reader, in this case perched. The system has to understand how the word works: the tense, voice, modality and other grammatic features have to be recognized to be successfully applied to the alternative. Then, using some synonym generation tool, a set of candidate words is generated. From these, we need to filter out those that may be synonyms in a general term but not in the original context of the difficult word – e.g., in our example alighted, while is a synonym for some occasions for perch, it does not fit the context well. Then, the generated and filtered alternatives are ranked using other criteria such as selecting the simplest possible word. Finally the sentence is reconstructed using the correct form of the alternative.

For this process, in SIMPATICO we use a wide array of individual tools. Some are part of the state of the art, such as the CoreNLP tookit from the University of Stanford, while others are developed in-house by the partners in the project (e.g., TINT by FBK or Lexenstein by University of Sheffield). During our work in SIMPATICO, we expect to advance the state of the art in the field in this domain for the work languages in the project: English, Italian, Spanish and Galician.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *