Gentle Start into NLP

Muhammad Ariq Naufal
Analytics Vidhya
Published in
4 min readJul 25, 2021

--

Many of you talking about NLP, text mining, text processing and anything about it, so what is NLP?

Photo by Markus Winkler on Unsplash

Natural Language Processing or you may called it NLP is a subfield of artificial intelligence which concern on how to make machine can read, understand, and respond to human language. Because machine only knows number or binary so using NLP can make the machine knows better the human language.

But before we jump into NLP we must know what is Text Mining, and what is the difference between NLP and text mining.

Text Mining is a process of examining large collection of documents to discover new information or help to answer specific question, Text Mining focuses on extracting meaningful information from text data with limited level of pattern analysis and structure matching. and NLP focus on how machine can read, interpret, and process human language both text and speech.

After you know about what is the difference between the NLP and text mining we move forward to the step of text processing, because as you may know text is very unstructured data because it contains many data that non relevant so the data must be cleaned, and there are several steps to clean data in text.

I have an example text like this

@F1 Verstappen is guilty! Verstappen still has a lot to learn! 😟😟👎

  • Tokenization

Tokenization is a process to split sentence into words.

Tokenization Example
  • Case Folding

Case Folding is a process to make the sentence into lowercase form.

Case folding example
  • Removing emoji & punctuation

This is a process to make the sentence does not contain any emoji and punctuation.

removing emoji & punctuation example
  • Stop word removing

This is a process to remove a word that has appear frequently but has no significant meaning. Example : is, has, a, and, the, etc. if not removing the stop word it may lead to the bias.

stop word removing example
  • Stemming & Lemmatization

This is a text normalization techniques, to return the word into its root form.

stemming & lemmatization example

After we make sure the text is clean enough, we can move forward into Feature Engineering, in NLP there are several techniques that you can use for feature engineering.

  • TF-IDF

Term Frequency-Inverse Document Frequency is a measure of how importance a word in document/category in a collection of corpus based on the number time of word appears in the document but is offset with the frequency of words in the corpus.

TF-IDF Formula

Above is the formula how TF-IDF get the weight for each words.

  • BoW

Bag-of-Words is a technique to measure the importance of word based on word occurrence in the document.

Weakness of BoW:

  1. Discarding word order may ignores the context, and in turn meaning of words in the document
  2. In a very large corpus, there will be many zero scores compound the vector matrix
  3. Highly frequent words dominate the document but may not contain as much “information contain”
  • Word Embedding

Word Embedding is a words with the same meaning, have same representation in vector space (real-valued vectors).

Word Embedding Example
  • Named Entity Relation

Named Entity Recognition (NER) is an algorithm for identifying named entities in a text and classifies them into predefined categories.

Entities can be names of people, organizations, locations, times, quantities, monetary values, percentages, and more.

NER mostly used for understanding the subject or theme of a body of text and quickly group texts based on their relevancy or similarity.

NER Example
  • Part of Speech — Tagging

Part-of-speech tagging is a process of giving a tag or mark to a word in a text or corpus. POS tagging is essential for building lemmatizes which are used to reduce a word to its root form.

Example of POS Tagging

There are 4 techniques for tagging:

  1. Lexical Based : Assign the POS tag with a word in the training corpus
  2. Rule Based : Assign POS tags based on rules
  3. Probabilistic : Assign POS tag based on the probability of the particular tag sequence occurring
  4. Deep Learning : Use Neural Network to assign POS tags

Thank you for reading my article, hope it helps you!

Photo by Cintya Marisa on Unsplash

--

--