Python Read Data From a Complex Text File Using Template

Introduction

If you have been part of the information science (or any data!) industry, you would know the challenge of working with dissimilar data types. Different formats, different compression, unlike parsing on different systems – you could exist chop-chop pulling your hair! Oh and I accept not talked about the unstructured data or semi-structured data yet.

For whatever data scientist or information engineer, dealing with dissimilar formats can go a tiresome job. In real-world, people rarely get neat tabular information. Thus, it is mandatory for whatever information scientist (or a data engineer) to be aware of different file formats, common challenges in treatment them and the best / efficient ways to handle this data in real life.

This article provides common formats a data scientist or a data engineer must be aware of. I volition first innovate yous to different mutual file formats used in the industry. Later, we'll encounter how to read these file formats in Python.

P.Due south. In rest of this article, I will be referring to a information scientist, but the same applies to a data engineer or whatever information science professional.

Tabular array of Contents

  1. What is a file format?
  2. Why should a data scientist understand dissimilar file formats?
  3. Unlike file formats and how to read them in Python?
    1. Comma-separated values
    2. XLSX
    3. Aught
    4. Evidently Text (txt)
    5. JSON
    6. XML
    7. HTML
    8. Images
    9. Hierarchical Data Format
    10. PDF
    11. DOCX
    12. MP3
    13. MP4

1. What is a file format?

A file format is a standard mode in which information is encoded for storage in a file. Offset, the file format specifies whether the file is a binary or ASCII file. Second, information technology shows how the information is organized. For case, comma-separated values (CSV) file format stores tabular information in plain text.

To identify a file format, yous can unremarkably expect at the file extension to go an idea. For example, a file saved with name "Data" in "CSV" format will appear as "Data.csv". Past noticing ".csv" extension we can clearly place that it is a "CSV" file and data is stored in a tabular format.

two. Why should a data scientist understand different file formats?

Commonly, the files y'all volition come across will depend on the application you are edifice. For example, in an paradigm processing organization, you need paradigm files as input and output. So you will mostly encounter files in jpeg, gif or png format.

As a data scientist, you need to sympathise the underlying construction of various file formats, their advantages and dis-advantages. Unless you sympathize the underlying structure of the data, you lot will non be able to explore it. Also, at times y'all need to make decisions nearly how to store data.

Choosing the optimal file format for storing data can improve the performance of your models in information processing.

Now, we volition look at the following file formats and how to read them in Python:

  • Comma-separated values
  • XLSX
  • ZIP
  • Manifestly Text (txt)
  • JSON
  • XML
  • HTML
  • Images
  • Hierarchical Data Format
  • PDF
  • DOCX
  • MP3
  • MP4

three. Different file formats and how to read them in Python

iii.one Comma-separated values

Comma-separated values file format falls under spreadsheet file format.

What is Spreadsheet File Format?

In spreadsheet file format, data is stored in cells. Each cell is organized in rows and columns. A column in the spreadsheet file can accept dissimilar types. For instance, a cavalcade tin can be of string type, a appointment type or an integer type. Some of the virtually pop spreadsheet file formats are Comma Separated Values ( CSV ), Microsoft Excel Spreadsheet ( xls ) and Microsoft Excel Open XML Spreadsheet ( xlsx ).

Each line in CSV file represents an observation or unremarkably called a record. Each record may incorporate one or more than fields which are separated by a comma.

Sometimes y'all may come across files where fields are not separated past using a comma but they are separated using tab. This file format is known as TSV (Tab Separated Values) file format.

The beneath image shows a CSV file which is opened in Notepad.

Reading the data from CSV in Python

Permit us look at how to read a CSV file in Python. For loading the data you tin use the "pandas" library in python.

import pandas as pd

df = pd.read_csv("/home/Loan_Prediction/railroad train.csv")

Above code will load the train.csv file in DataFrame df.

3.2 XLSX files

XLSX is a Microsoft Excel Open XML file format. Information technology likewise comes under the Spreadsheet file format. Information technology is an XML-based file format created past Microsoft Excel. The XLSX format was introduced with Microsoft Function 2007.

In XLSX data is organized under the cells and columns in a sheet. Each XLSX file may contain 1 or more than sheets. So a workbook tin incorporate multiple sheets.

The below paradigm shows a "xlsx" file which is opened in Microsoft Excel.

In to a higher place image, you tin see that there are multiple sheets present (lesser left) in this file, which are Customers, Employees, Invoice, Gild. The image shows the data of only one sheet – "Invoice".

Reading the data from XLSX file

Let's load the data from XLSX file and ascertain the sheet name. For loading the data yous tin can employ the Pandas library in python.

import pandas as pd

df = pd.read_excel("/home/Loan_Prediction/train.xlsx", sheetname = "Invoice")

Above lawmaking volition load the sheet "Invoice" from "train.xlsx" file in DataFrame df.

three.3 ZIP files

Nada format is an archive file format.

What is Archive File format?

In Archive file format, yous create a file that contains multiple files forth with metadata. An annal file format is used to collect multiple information files together into a single file. This is done for simply compressing the files to use less storage space.

In that location are many popular figurer data archive format for creating archive files. Nix, RAR and Tar existence the nigh popular archive file format for compressing the data.

So, a Null file format is a lossless compression format, which ways that if you compress the multiple files using Aught format you lot can fully recover the data later decompressing the Zilch file. Null file format uses many compression algorithms for compressing the documents. You can easily place a ZIP file by the .nil extension.

Reading a .Cypher file in Python

You lot can read a zip file by importing the "zipfile" package. Beneath is the python code which can read the "train.csv" file that is inside the "T.zip".

import zipfile archive = zipfile.ZipFile('T.zip', 'r') df = archive.read('train.csv')

Here, I have discussed ane of the famous archive format and how to open up information technology in python. I am not mentioning other annal formats. If you desire to read about different annal formats and their comparisons y'all tin refer this link.

3.four Plainly Text (txt) file format

In Plain Text file format, everything is written in plainly text. Normally, this text is in unstructured form and there is no meta-data associated with it. The txt file format tin can hands be read by whatsoever program. But interpreting this is very difficult past a estimator program.

Let's have a simple instance of a text File.

The following example shows text file data that incorporate text:

"In my previous article, I introduced you to the basics of Apache Spark, unlike information representations  (RDD / DataFrame / Dataset) and basics of operations (Transformation and Action). We fifty-fifty solved a machine  learning problem from one of our past hackathons. In this article, I will go along from the place I left in  my previous article. I volition focus on manipulating RDD in PySpark by applying operations  (Transformation and Actions)."

Suppose the above text written in a file chosen text.txt and y'all desire to read this so you tin refer the below code.

text_file = open up("text.txt", "r") lines = text_file.read()

three.5 JSON file format

JavaScript Object Notation(JSON) is a text-based open standard designed for exchanging the data over spider web. JSON format is used for transmitting structured information over the web. The JSON file format can be hands read in whatever programming language considering it is language-independent information format.

Permit's accept an example of a JSON file

The following example shows how a typical JSON file stores information of employees.

{    "Employee": [        {           "id":"1",           "Name": "Ankit",           "Sal": "1000",        },       {           "id":"ii",           "Name": "Faizy",           "Sal": "2000",        }     ]  }

Reading a JSON file

Let'due south load the data from JSON file. For loading the data you tin can use the pandas library in python.

import pandas as pd

df = pd.read_json("/habitation/kunal/Downloads/Loan_Prediction/train.json")

3.6 XML file format

XML is also known every bit Extensible Markup Linguistic communication. As the name suggests, it is a markup language. Information technology has certain rules for encoding data. XML file format is a human-readable and automobile-readable file format. XML is a cocky-descriptive language designed for sending information over the internet. XML is very similar to HTML, but has some differences. For example, XML does not use predefined tags every bit HTML.

Let'due south accept the simple example of XML File format.

The following instance shows an xml document that contains the information of an employee.

<?xml version="1.0"?> <contact-info>  <name>Ankit</proper noun>  <company>Anlytics Vidhya</company>  <telephone>+9187654321</telephone>  </contact-info>

The  "<?xml version="i.0″?>" is  a  XML proclamation at the start of the file (information technology is optional). In this deceleration, fiveersion specifies the XML version and encoding specifies the character encoding used in the document. <contact-info> is a tag in this document. Each XML-tag needs to be closed.

Reading XML in python

For reading the information from XML file you lot can import xml.etree. ElementTree library.

Let'southward import an xml file chosen railroad train and print its root tag.

import xml.etree.ElementTree every bit ET tree = ET.parse('/dwelling/sunilray/Desktop/2 sigma/train.xml') root = tree.getroot() print root.tag

iii.7 HTML files

HTML stands for Hyper Text Markup Language. It is the standard markup language which is used for creating Spider web pages. HTML is used to describe construction of web pages using markup. HTML tags are same as XML simply these are predefined. Y'all can easily identify HTML document subsection on basis of tags such as <head> represent the heading of HTML document. <p> "paragraph" paragraph in HTML. HTML is not case sensitive.

The following case shows an HTML certificate.

<!DOCTYPE html> <html> <head> <title>Page Title</title> </head> <trunk><h1>My Start Heading</h1> <p>My commencement paragraph.</p></body> </html>

Each tag in HTML is enclosed under the athwart bracket(<>).  The <!DOCTYPE html> tag defines that certificate is in HTML format. <html> is the root tag of this document.  The <head> element contains heading part of this document. The <title>, <body>, <h1>, <p> represent the title, body, heading and paragraph respectively in the HTML document.

Reading the HTML file

For reading the HTML file, yous tin use BeautifulSoup library. Please refer to this tutorial, which volition guide you how to parse HTML documents. Beginner'southward guide to Web Scraping in Python (using BeautifulSoup)

3.8 Epitome files

Image files are probably the about fascinating file format used in data science. Any computer vision awarding is based on image processing. Then information technology is necessary to know different image file formats.

Usual image files are 3-Dimensional, having RGB values. Just, they tin can as well be 2-Dimensional (grayscale) or iv-Dimensional (having intensity) – an Image consisting of pixels and meta-data associated with it.

Each epitome consists ane or more frames of pixels. And each frame is made up of two-dimensional array of pixel values. Pixel values tin be of whatsoever intensity.  Meta-data associated with an epitome, can exist an epitome type (.png) or pixel dimensions.

Permit's take the instance of an prototype by loading it.

from scipy import misc f = misc.face() misc.imsave('confront.png', f) # uses the Image module (PIL) import matplotlib.pyplot every bit plt plt.imshow(f) plt.evidence()

At present, let'due south bank check the type of this paradigm and its shape.

type(f) , f.shape
numpy.ndarray,(768, 1024, 3)

If yous want to read about image processing you can refer this commodity. This commodity volition teach yous image processing with an example – Basics of Image Processing in Python

3.9 Hierarchical Data Format (HDF)

In Hierarchical Information Format ( HDF ), y'all can store a large amount of information hands. It is not only used for storing loftier volumes or complex information but also used for storing small volumes or simple information.

The advantages of using HDF are as mentioned beneath:

  • It tin can be used in every size and blazon of system
  • It has flexible, efficient storage and fast I/O.
  • Many formats back up HDF.

There are multiple HDF formats present. But, HDF5 is the latest version which is designed to address some of the limitations of the older HDF file formats. HDF5 format has some similarity with  XML. Like XML, HDF5 files are cocky-describing and let users to specify complex information relationships and dependencies.

Permit'due south take the example of an HDF5 file format which can exist identified using .h5 extension.

Read the HDF5 file

You tin can read the HDF file using pandas. Below is the python code can load the railroad train.h5 data into the "t".

t = pd.read_hdf('train.h5')

3.10 PDF file format

PDF (Portable Document Format) is an incredibly useful format used for estimation and display of text documents along with incorporated graphics. A special feature of a PDF file is that it can be secured by a password.

Hither's an example of a pdf file.


Reading a PDF file

On the other manus, reading a PDF format through a programme is a complex task. Although in that location exists a library which practice a good chore in parsing PDF file, one of them is PDFMiner. To read a PDF file through PDFMiner, y'all have to:

  • Download PDFMiner and install it through the website
  • Extract PDF file by the following code
pdf2txt.py <pdf_file>.pdf

iii.xi DOCX file format

Microsoft give-and-take docx file is some other file format which is regularly used by organizations for text based data. It has many characteristics, like inline addition of tables, images, hyperlinks, etc. which helps in making docx an incredibly important file format.

The advantage of a docx file over a PDF file is that a docx file is editable. You can likewise change a docx file to any other format.

Here'due south an case of a docx file:

Reading a docx file

Similar to PDF format, python has a community contributed library to parse a docx file. Information technology is called python-docx2txt.

Installing this library is easy through pip by:

pip install docx2txt

To read a docx file in Python utilise the following code:

import docx2txt text = docx2txt.process("file.docx")        

3.12 MP3 file format

MP3 file format comes under the multimedia file formats. Multimedia file formats are similar to prototype file formats, simply they happen to exist one the most complex file formats.

In multimedia file formats, you can store variety of data such as text prototype, graphical, video and audio information. For instance, A multimedia format can allow text to be stored as Rich Text Format (RTF) data rather than ASCII data which is a plain-text format.

MP3 is one of the most common audio coding formats for digital sound. A mp3 file format uses the MPEG-i (Pic Experts Group – ane) encoding format which is a standard for lossy pinch of video and sound. In lossy compression, once you accept compressed the original file, you cannot recover the original data.

A mp3 file format compresses the quality of audio by filtering out the audio which can not be heard by humans. MP3 compression unremarkably achieves 75 to 95% reduction in size, so information technology saves a lot of space.

mp3 File Format Structure

A mp3 file is made upwardly of several frames. A frame can be further divided into a header and data block. Nosotros call these sequence of frames an elementary stream.

A header in mp3 usually, identify the beginning of a valid frame and a information blocks contain the (compressed) audio information in terms of frequencies and amplitudes. If you want to know more than about mp3 file structure y'all can refer this link.

Reading the multimedia files in python

For reading or manipulating the multimedia files in Python yous can use a library called PyMedia.

3.13 MP4 file format

MP4 file format is used to store videos and movies. Information technology contains multiple images (called frames), which play in course of a video as per a specific time catamenia. There are two methods for interpreting a mp4 file. One is a airtight entity, in which the whole video is considered as a single entity. And other is mosaic of images, where each epitome in the video is considered every bit a dissimilar entity and these images are sampled from the video.

Hither's is an instance of mp4 video


Reading an mp4 file

MP4 also has a customs built library for reading and editing mp4 files, called MoviePy.

You tin install the library from this link. To read a mp4 video clip, in Python use the following code.

from moviepy.editor import VideoFileClip prune = VideoFileClip('<video_file>.mp4')

You can then display this in jupyter notebook as below

ipython_display(clip)

End Notes

In this article, I have introduced y'all to some of the basic file formats, which are used by data scientist on a solar day to mean solar day ground. There are many file formats I have not covered. Adept thing is that I don't need to comprehend all of them in ane article.

I hope you institute this article helpful. I would encourage you to explore more file formats. Good luck! If you still have any difficulty in understanding a specific data format, I'd similar to interact with you in comments. If you accept whatsoever more than doubts or queries feel free to drop in your comments below.

Learn, compete, hack and become hired

cainafteally.blogspot.com

Source: https://www.analyticsvidhya.com/blog/2017/03/read-commonly-used-formats-using-python/

0 Response to "Python Read Data From a Complex Text File Using Template"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel