{ "cells": [ { "cell_type": "markdown", "metadata": { "_uuid": "2b9552fa0ddd512113e41bb41badf7bf86389a4e" }, "source": [ "## Loading the required modules\n", "\n", "Let;s get started by loading all the required modules and defining all the constants and variables that we will be needing all throughout the notebook" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "_uuid": "3cd8836844c84ba8afe2b4724418b68fb1298b27" }, "outputs": [], "source": [ "import os\n", "import numpy as np\n", "import tensorflow as tf\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.feature_selection import SelectKBest\n", "from sklearn.feature_selection import f_classif\n", "\n", "from tensorflow.python.keras import models\n", "from tensorflow.python.keras.layers import Dense\n", "from tensorflow.python.keras.layers import Dropout\n", "\n", "path = 'data/'" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "da2eb32b4ececc890562532b3ef8ba0825414caf" }, "source": [ "## Load the Dataset\n", "\n", "In this section, let's load the dataset and shuffle it so to make ready for analysis." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "_uuid": "d8b776d6dc48490a553a768ae0f8f358f23ab20a" }, "outputs": [], "source": [ "import os\n", "path='C:/Users/Zaid Naeem/Documents/final Anaconda/data/aclImdb/'\n", "\n", "#path = 'C:\\Users\\Zaid Naeem\\Documents\\final Anaconda\\data\\aclImdb'\n", "positiveFiles = [x for x in os.listdir(path+\"train/pos/\") if x.endswith(\".txt\")]\n", "negativeFiles = [x for x in os.listdir(path+\"train/neg/\") if x.endswith(\".txt\")]\n", "testFiles = [x for x in os.listdir(path+\"test/\") if x.endswith(\".txt\")]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "_uuid": "b71385f428efe17e820f6f14da6e3f4cc4d78928" }, "outputs": [], "source": [ "positiveReviews, negativeReviews, testReviews = [], [], []\n", "for pfile in positiveFiles:\n", " with open(path+\"train/pos/\"+pfile, encoding=\"latin1\") as f:\n", " positiveReviews.append(f.read())\n", "for nfile in negativeFiles:\n", " with open(path+\"train/neg/\"+nfile, encoding=\"latin1\") as f:\n", " negativeReviews.append(f.read())\n", "for tfile in testFiles:\n", " with open(path+\"test/\"+tfile, encoding=\"latin1\") as f:\n", " testReviews.append(f.read())" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "fa744957e3aa9a8d007538e59cdfff7bc94f498e" }, "source": [ "Now, lets load the dataset and perform some analysis on the dataset!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "_uuid": "b1ff87c4c3848bfe182fefeb2e5a52e389e2f89f" }, "outputs": [ { "data": { "text/html": [ "
\n", " | review | \n", "label | \n", "file | \n", "
---|---|---|---|
21939 | \n", "Gwyneth Paltrow is absolutely great in this mo... | \n", "0 | \n", "7246_4.txt | \n", "
24113 | \n", "I own this movie. Not by choice, I do. I was r... | \n", "0 | \n", "9202_1.txt | \n", "
4633 | \n", "Well I guess it supposedly not a classic becau... | \n", "1 | \n", "2920_8.txt | \n", "
17240 | \n", "I am, as many are, a fan of Tony Scott films. ... | \n", "0 | \n", "3016_1.txt | \n", "
4894 | \n", "I wish \"that '70s show\" would come back on tel... | \n", "1 | \n", "3155_10.txt | \n", "