Skip to main content

Natural language processing

The Datagrok platform has a plugin designed for natural language processing. This feature can be applied to comfortably work with files with text content. To try it out, import your text files according to the instructions or use the provided demo files. The rest of the article is dedicated to the main applications of this domain integrated into the platform.

Text extraction

It all starts with extracting text. This is a building block for other, more complex tasks. Due to the high demand, it is essential to support as many popular text file formats as possible. The platform comes with a built-in file browser for easy file management. The NLP package extends it by processing text from pdf, doc, docx, odt, and other text formats.

Extract text from PDF In addition to the file preview, the plugin enables the 'Text Extractor' info panel

Language identification

Determining the language of a document is an important preprocessing step for many language-related tasks. Automatic language detection may be part of applications that perform machine translation or semantic analysis. Datagrok's language identification is powered by Google's Compact Language Detector v3 (CLD3) and supports over 100 languages. As with text extraction, this functionality is used in the Translation info panel.

Neural machine translation

The package creates a new info panel for text files. It uses AWS Translate service, which supports over 70 languages.

To translate a text, navigate to the file browser and select one of the demo files. Alternatively, open your personal folder and drag-and-drop your file to the platform. Now, whenever you click on the file, you will see a suggestion to translate it in the context panel on the right.

Translate text files

The language is identified automatically, but you always have a chance to change it manually. The default target language is English, so be sure to choose another option if the original text is in English.

Text statistics

Increasingly often texts are analyzed for readability. Readability scores take into account various parameters: the average number of words per sentence or syllables per word, percentage of long words, etc.

The Text Statistics info panel calculates two common formulas:

Calculate text statistics


User Meeting 9: Natural Language Processing

See also: