Data Files, Formats, and Types

13.12.2019 |

Episode #4 of the course Data fundamentals by Colby Schrauth and Serge LeBlanc

 

Welcome back!

Yesterday, we learned that there are three main tools that can cover the vast majority of data analysis projects:

• spreadsheets

• databases and SQL

• BI software

Today, we’ll introduce the concepts of data files, formats, and types. Let’s jump in!

 

Data Files

When you hear data file, think file extension. In other words, it’s the way in which a file is saved, which has an impact on how different software applications access and consume the data within it.

For example, Excel files are saved with the extension, .xlsx. The .xlsx data file is Excel’s way of saving data in a particular structure, so the Excel program knows how to read and process the data stored in it. The most common data files are:

• CSV: comma-separated values, unformatted text or numbers delimited by commas

• TSV: tab-separated values, unformatted text or numbers delimited by tabs

• TXT: standard text document that contains an unformatted text document

• DOCX: Microsoft Word-specific extension for a formatted text document

• XLSX: Microsoft Excel-specific extension for a formatted spreadsheet

 

Data Formats

Data formats refer to how the data fields are formatted, or structured, within a file itself. In the previous section, we compared CSV to TSV file formats. Both of these file formats save data in a flat way, meaning there aren’t any nested relationships between data records within the file. For this reason, you’ll hear CSV and TSV files being referred to as flat files, ones that have a fixed structure.

In cases where the data records contain more complex relationships, we need to store the data in a more flexible format that allows for these relationships to be maintained.

For example, what if we wanted to store data for someone’s name and type of car?

Name Car
Serge Ford
Colby Saturn

 

But what if one person has two cars? Would we add a third column (see table below) and leave the other person’s cell blank (an inefficient use of space)? What if someone had five cars?

Name Car Second Car
Serge Ford
Colby Saturn Honda

 

This is where semi-structured/nested data formats come into play, and the popular data format for this type of use case is called JSON (JavaScript Object Notation). JSON is made up of what we call key-value pairs, the names on the left being the keys and the fields on the right being the values, all enveloped by curly braces. For each key, there can be any number of different values, without having to add columns.

The above example in JSON would look like this:

[
{“Name”: “Serge” , “Car”: [“Ford”]},
{“Name”: “Colby” , “Car”: [“Saturn”, “Honda”]}
]

The point here isn’t to be an expert on decoding JSON, but rather to simply know that there are various forms of storing and structuring data. For instance, in addition to JSON, other common formats are Avro, ORC, and Parquet.

 

Data Types

Data types are simply categories assigned by a data tool or programming language that provide information about the range of possibilities for the values that data can have. There are many different data types in the industry, but the three main categories are:

• Text: This is typically referred to as a string and can have a variable (e.g., a comment box on a website) or fixed size (e.g., two-letter country code).

• Numbers: Different types include integer (round numbers), numeric (decimal values), and boolean (0 or 1).

• Dates: These describe dates in time, often stored in a “year-month-day” format.

Data types have a big impact on data tool functionality and performance and can lead to incorrect interpretation of results in certain cases. For instance:

• Financial transaction data saved as integer instead of numeric would round to the nearest dollar.

• A date saved as a timestamp without knowing the timezone would be impossible to convert.

Now that you have an understanding of the different ways we can store data, tomorrow, we’ll take a look at cleaning and organizing data!

—Colby and Serge

 

Recommended book

Confident Data Skills: Master the Fundamentals of Working with Data and Supercharge your Career by Kirill Eremenko

 

Share with friends