Gentle Start into NLP

Published in

Analytics Vidhya

4 min readJul 25, 2021

Many of you talking about NLP, text mining, text processing and anything about it, so what is NLP?

Natural Language Processing or you may called it NLP is a subfield of artificial intelligence which concern on how to make machine can read, understand, and respond to human language. Because machine only knows number or binary so using NLP can make the machine knows better the human language.

But before we jump into NLP we must know what is Text Mining, and what is the difference between NLP and text mining.

Text Mining is a process of examining large collection of documents to discover new information or help to answer specific question, Text Mining focuses on extracting meaningful information from text data with limited level of pattern analysis and structure matching. and NLP focus on how machine can read, interpret, and process human language both text and speech.

After you know about what is the difference between the NLP and text mining we move forward to the step of text processing, because as you may know text is very unstructured data because it contains many data that non relevant so the data must be cleaned, and there are several steps to clean data in text.

I have an example text like this

@F1 Verstappen is guilty! Verstappen still has a lot to learn! 😟😟👎

Tokenization

Tokenization is a process to split sentence into words.

Case Folding

Case Folding is a process to make the sentence into lowercase form.

Case folding example

Removing emoji & punctuation

This is a process to make the sentence does not contain any emoji and punctuation.

removing emoji & punctuation example

Stop word removing

This is a process to remove a word that has appear frequently but has no significant meaning. Example : is, has, a, and, the, etc. if not removing the stop word it may lead to the bias.

stop word removing example

Stemming & Lemmatization

This is a text normalization techniques, to return the word into its root form.

stemming & lemmatization example

After we make sure the text is clean enough, we can move forward into Feature Engineering, in NLP there are several techniques that you can use for feature engineering.

TF-IDF

Term Frequency-Inverse Document Frequency is a measure of how importance a word in document/category in a collection of corpus based on the number time of word appears in the document but is offset with the frequency of words in the corpus.

Above is the formula how TF-IDF get the weight for each words.

Bag-of-Words is a technique to measure the importance of word based on word occurrence in the document.

Weakness of BoW:

Discarding word order may ignores the context, and in turn meaning of words in the document
In a very large corpus, there will be many zero scores compound the vector matrix
Highly frequent words dominate the document but may not contain as much “information contain”

Word Embedding

Word Embedding is a words with the same meaning, have same representation in vector space (real-valued vectors).

Named Entity Relation

Named Entity Recognition (NER) is an algorithm for identifying named entities in a text and classifies them into predefined categories.

Entities can be names of people, organizations, locations, times, quantities, monetary values, percentages, and more.

NER mostly used for understanding the subject or theme of a body of text and quickly group texts based on their relevancy or similarity.

Part of Speech — Tagging

Part-of-speech tagging is a process of giving a tag or mark to a word in a text or corpus. POS tagging is essential for building lemmatizes which are used to reduce a word to its root form.

There are 4 techniques for tagging:

Lexical Based : Assign the POS tag with a word in the training corpus
Rule Based : Assign POS tags based on rules
Probabilistic : Assign POS tag based on the probability of the particular tag sequence occurring
Deep Learning : Use Neural Network to assign POS tags

Thank you for reading my article, hope it helps you!

Gentle Start into NLP

Written by Muhammad Ariq Naufal