A Guide to read Word File using Python

Introduction

This tutorial shows a guide on how to read word file using Python. You know that word file is great for documentation purpose. This tutorial also shows how to install docx and nltk modules under Windows Operating System. These modules are required to read word or docx file using Python.

Benefits of Word Document

This tool is used in many areas and some of them are given below:

  • You can create all types of official documents in Microsoft Word.
  • You can create lecture script by using text, word art, shapes, colors, and images.
  • You can create a birthday card, invitation card in Microsoft Word by using pre-defined templates or using insert menu and format menus functions.
  • You can highlight basic and advance knowledge of MS Word as great skill in your resume for the job interview.
  • You can create notes and assignment on MS-word.
  • You can create and print a book using MS Word by creating a cover page, content, head and footers, image adjustments, text alignment and text highlighter etc.
  • You can start your business online and offline. You need to create documents for official works.
  • You can use Microsoft word to collaborate with your team while working on the same project and document.
  • What’s more, this software is widely used in many different application fields all over the world and it also applies to data science.

Related Posts:

You might have seen various operations on word files using wonderful API – Apache POI in Java technology and it requires few more lines of code have to be written to read from or write to word files.

But to read word file using Python is very easy with a few lines of code. I will use a sample word file here to read the word file.

You may also download the sample word file through Google search and give it a try.

Let’s move on to the example…

Prerequisites

Python 3.8.0 – 3.9.1, Package – docx, nltk

Preparing Workspace

Preparing your workspace is one of the first things that you can do to make sure that you start off well. The first step is to check your working directory.

When you are working in the Python terminal, you need first navigate to the directory, where your file is located and then start up Python, i.e., you have to make sure that your file is located in the directory where you want to work from.

Installing Modules

Check for modules docx and nltk in Python terminal. Type the command as shown below to check docx and nltk package. If you do not get any error message then the module exists otherwise you have to install the non-existence module.

import docx
import nltk

If you do not have docx and nltk module available then please find below steps to install docx and nltk modules in Windows Operating System.

Please make sure you open cmd prompt in administrator mode

Installing Module – docx

Execute below command to install docx module. Though below image shows older version of python-docx just for showing how to install, but here I am using python-docx version 0.8.10. Actually executing the command pip install python-docx will install the latest version of module.

read word file using python

Installing Module – nltk

Now you will see how to install nltk module

Execute below command to install nltk module. Make sure you open cmd prompt in administrator mode.

read word file using python

Installing nltk is not enough as shown above, you need to download the required packages. So download using the below command in Python window.

read word file using python

Now a popup window will open for downloading required packages:

read word file using python

Once required packages are downloaded, you should see following screen.

read word file using python

You are done installing nltk.

Reading Word File

Now let’s move on to the example read word file using Python.

In the below image you see I have opened a cmd prompt and navigated to the directory where I have put the word file that has to be read.

I will read the below word file using Python programming language. I will read the whole content from word file and display those content into Python console. You may read the word file content and do something else for your business using the Python programming.

The above word file should be put into the C:\py_scripts directory where I will also put the Python script to read the word file.

Now create a Python script read_word.py under the C:\py_scripts for reading the above word file. Here py is extension of the Python file.

In the below Python script notice how I imported docx and nltk module.

The below Python script shows how to read word file using Python.

import docx

#Extract text from DOCX
def getDocxContent(filename):
    doc = docx.Document(filename)
    fullText = ""
    for para in doc.paragraphs:
        fullText += para.text
    return fullText
	
resume = getDocxContent("sample.docx")

#Importing NLTK for sentence tokenizing
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(resume)
for sentence in sentences:
	print(sentence)
	print("\n")

Testing the Script

When you execute the above Python script, then you should see the following output in the console.

read word file using python

Here is the sample file.

Hope you understood how to read word file using Python.

Source Code

Download

Leave a Reply

Your email address will not be published. Required fields are marked *