{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"_uuid": "82a3f11267d24990d6ef9323c5c97b4f9af60f11"
},
"source": [
"# IMDB Sentiments\n",
"\n",
"## Introduction\n",
"\n",
"This notebook follows the Text Classification guide from Google Machine Learning Guides.
\n",
"This notebook contains all the code that the guide shows in the tutorial and not in its github repo. Hope this guide helps you as you follow the Text Classification guide.\n",
"\n",
"Link to the Guide: https://developers.google.com/machine-learning/guides/text-classification/\n",
"\n",
"In this notebook, we see how to perform sentiment analysis using IMDB Movie Reviews Dataset. We will classify reviews into `2` labels: _positive(`1`)_ and _negetive(`0`)_. And we will encode the data using tf-idf and feed into a Multi-layer Perceptron. We will use tensorflow, with Keras API."
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "2b9552fa0ddd512113e41bb41badf7bf86389a4e"
},
"source": [
"## Loading the required modules\n",
"\n",
"Let;s get started by loading all the required modules and defining all the constants and variables that we will be needing all throughout the notebook"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"_uuid": "3cd8836844c84ba8afe2b4724418b68fb1298b27"
},
"outputs": [],
"source": [
"import os\n",
"import numpy as np\n",
"import tensorflow as tf\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.feature_selection import SelectKBest\n",
"from sklearn.feature_selection import f_classif\n",
"\n",
"from tensorflow.python.keras import models\n",
"from tensorflow.python.keras.layers import Dense\n",
"from tensorflow.python.keras.layers import Dropout\n",
"\n",
"path = 'data/'"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "da2eb32b4ececc890562532b3ef8ba0825414caf"
},
"source": [
"## Load the Dataset\n",
"\n",
"In this section, let's load the dataset and shuffle it so to make ready for analysis."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"_uuid": "d8b776d6dc48490a553a768ae0f8f358f23ab20a"
},
"outputs": [],
"source": [
"import os\n",
"path='C:/Users/Zaid Naeem/Documents/final Anaconda/data/aclImdb/'\n",
"\n",
"#path = 'C:\\Users\\Zaid Naeem\\Documents\\final Anaconda\\data\\aclImdb'\n",
"positiveFiles = [x for x in os.listdir(path+\"train/pos/\") if x.endswith(\".txt\")]\n",
"negativeFiles = [x for x in os.listdir(path+\"train/neg/\") if x.endswith(\".txt\")]\n",
"testFiles = [x for x in os.listdir(path+\"test/\") if x.endswith(\".txt\")]"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"_uuid": "b71385f428efe17e820f6f14da6e3f4cc4d78928"
},
"outputs": [],
"source": [
"positiveReviews, negativeReviews, testReviews = [], [], []\n",
"for pfile in positiveFiles:\n",
" with open(path+\"train/pos/\"+pfile, encoding=\"latin1\") as f:\n",
" positiveReviews.append(f.read())\n",
"for nfile in negativeFiles:\n",
" with open(path+\"train/neg/\"+nfile, encoding=\"latin1\") as f:\n",
" negativeReviews.append(f.read())\n",
"for tfile in testFiles:\n",
" with open(path+\"test/\"+tfile, encoding=\"latin1\") as f:\n",
" testReviews.append(f.read())"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "fa744957e3aa9a8d007538e59cdfff7bc94f498e"
},
"source": [
"Now, lets load the dataset and perform some analysis on the dataset!"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"_uuid": "b1ff87c4c3848bfe182fefeb2e5a52e389e2f89f"
},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | review | \n", "label | \n", "file | \n", "
---|---|---|---|
21939 | \n", "Gwyneth Paltrow is absolutely great in this mo... | \n", "0 | \n", "7246_4.txt | \n", "
24113 | \n", "I own this movie. Not by choice, I do. I was r... | \n", "0 | \n", "9202_1.txt | \n", "
4633 | \n", "Well I guess it supposedly not a classic becau... | \n", "1 | \n", "2920_8.txt | \n", "
17240 | \n", "I am, as many are, a fan of Tony Scott films. ... | \n", "0 | \n", "3016_1.txt | \n", "
4894 | \n", "I wish \"that '70s show\" would come back on tel... | \n", "1 | \n", "3155_10.txt | \n", "