Project
AI4KNOWLEDGE

Semantic analysis of texts

Objective

The purpose of this solution is the creation of a tool, based on artificial intelligence techniques, which allows to:

  • Extract text, tables, images and other elements from newspaper pages, scientific publications, manuals, data sheets, etc .;
  • Translate, through OCR, images containing text into actual text;
  • Subject the texts obtained to semantic analysis, with the dual purpose of obtaining the indexing of the contents and reconstructing the text in a web friendly form;
  • Create a question response system that automatically answers questions asked through natural language, extracting content from the knowledge base created in the previous step.

Pipeline

Image Processing

The image undergoes a series of transformations that serve to identify the regions of interest.

Text extraction & OCR

Tesseract is an open source OCR The version used in AIVision is based on LSTM type neural networks. It is able to recognize 33 languages.

Text validation

I] Magazzino cooperativo é un albero magnifico, i cul rami s’allargano e si rinnovano ogni di pil; 6 uno splendido fuoco che riscalda e riverbera la sua luce dappertutto. Ben a ragione gli operai di Rochdale assunsero il nome di Probi Pionieri; il pioniere é intrepido americano che apre i primi solchi nelle vergini foreste, e questi Pionieri di Rochdale hanno schiuso alle elassi lavoratrici la via dell’avvenire.

Luzzatti

Text is recomposed on a single line, without carriage return, non-alphanumeric characters and punctuation are removed strings less than two characters are removed keywords are removed, reducing the text to a keyword list each word key is validated on a dictionary of approximately 1 million words. Invalid words are replaced by the closest dictionary word based on measurable criteria.

Final listing of keywords

=

semantic domain of the text fragment

Question answering

Answering questions is not anymore looking for a string in a text, but for a concept in a piece of knowledge (ontology), according to the context.

The set of all semantic contexts collected from the various text fragments is stored in a database with the level of accuracy of the page, and constitutes the ontology on which the answers provided to the user are based.

Questions are forwarded to the database, which uses full-text search to search for answers sorted by rankings.

Request an online demo

One of our consultants will assist you with the explanation.

By filling out the form above and sending a request for information, I declare that I have read the privacy policy of ATG Artificial Intelligence SRL and accept the terms.