
If you’re a python developer, there’s a good chance you’re familiar with pandas or other DataFrame libraries like polars. These libraries offer a powerful collection of functionality for cleaning, transforming, and analyzing data. Essentially, they make handling tabular data ergonomic in Python. When it comes to the dataframe, I often think about it as “bringing spreadsheets into Python”. There are a lot of good reasons to use DataFrames over spreadsheets, but it doesn’t solve all the problems seen with spreadsheets. We’ve all experienced the frustration of convoluted and opaque spreadsheet logic, and unfortunately you can find yourself in the same situation using any of these DataFrame libraries. This article will specifically utilize pandas, but the cautionary advice applies to any similar libraries.
I’ve been a part of many projects where pandas was used extensively. Typically, these projects start out with a problem like the following: there exists some tabular data that needs to be loaded (from any number of sources like csv, excel, and sql tables being common) and that data needs to be transformed and integrated into some business processes.
pandas has extremely convenient functions for loading and transforming data via the dataframe so it’s an easy choice. Typically, once loaded, there’s a bit of light cleaning required for the data – maybe standardizing datetimes, stripping whitespace, or converting data types within the columns. This is what I like to call “small-T” transformation. It’s after that that there are typically some “big-T” transformations that have more to do with the business logic of the application.
It’s after the “small-T” point where I’d argue that the utility of the dataframe starts to diminish or become harmful and the developer should strongly consider using a domain-specific data structure.
How relying on the dataframe can create problems
I’ve seen many applications where the dataframe becomes the central data structure seen throughout the application, i.e. many functions expect to receive a dataframe and / or return a dataframe. Oftentimes many of these functions get chained together.
But it’s impossible to know the shape of that dataframe without tracing its transformations through the code it is passing through, or inspecting it at a specific point during runtime. As you’re writing new code it’s easy enough to pass a dataframe that was just modified in one function to another function and remember what to expect, but it’s extremely difficult to remember the shape at a later date. Essentially, the dataframe becomes opaque, and utilizing it as the central data structure of application logic fails to be expressive.
There is one obvious exception to my recommendation to move away from the dataframe in the core of the application – analytics. If the central logic of your application is to perform analytics requiring aggregations, groupings, filterings, etc. then the dataframe is still a great choice and pandas has powerful tools for this. But if you don’t consider the purpose of your application logic to be “analytics”, then I would suggest that you take the time to build more expressive data structures for the core of your application.
The commonly seen problem
Let’s look at an opaque example and then consider a possible alternative that, in my opinion, is worth pursuing because it is more expressive and easier to reason about (especially when you come back to it later in the far future).
Let’s assume the big task assigned is this: we have a list of contact information in a csv, and we need to create a shipping label for each contact.
Here is some example input. It is already fairly clean, but it offers a couple opportunities for “small-T” data cleaning that we often need.
first_name,last_name,address,city,state,zip_code_5,zip_code_4
Bob,Smith ,123 Main St,Anytown,CA,12345 ,
alan j.,johnson,456 Elm St,Anytown,CA,12345,
Carrie,Williams,789 Oak St,Anytown ,ca, 12345 ,6789
Jackie,Smith,,Anytown,CA,12345,N/A
Let’s write some functions to load and clean the data.
import pandas as pd
def load_addresses_file(file_path: str) -> pd.DataFrame:
return pd.read_csv(file_path)
def clean_addresses(df: pd.DataFrame) -> pd.DataFrame:
"""Replace loaded columns with cleaned versions"""
df["first_name"] = df["first_name"].str.strip().str.title()
df["last_name"] = df["last_name"].str.strip().str.title()
df["address"] = df["address"].str.strip().str.title()
df["city"] = df["city"].str.strip().str.title()
df["state"] = df["state"].str.strip().str.upper()
# use pandas extension type Int64 to handle null values better than int64
df["zip_code_5"] = df["zip_code_5"].astype("Int64")
df["zip_code_4"] = df["zip_code_4"].astype("Int64")
# drop any records that are missing a value in any of the required columns
df = df.dropna(subset=['first_name', 'last_name', 'address', 'city', 'state', 'zip_code_5'])
return df
Now we see that we can use pandas to load a csv file into a dataframe, and we have a function that performs basic cleaning of the data, such as stripping whitespace, performing title case conversion or uppercase conversion, and eliminating records that are incomplete.
The opaqueness of the dataframe
So let’s get into the details relevant to our application – we need to create a complete shipping label for each contact. The first line will be the first and last name, followed by the home address on the second line, and finally a line for city, state, and zip code. The zip code needs to optionally account for the 4-digit extension, if provided.
def create_text_of_shipping_labels(df: pd.DataFrame) -> pd.DataFrame:
"""Add a "shipping_label" column to the DataFrame"""
# Build the zip code part, including zip_code_4 if it exists
full_zip = df["zip_code_5"].astype(str)
zip4_exists_mask = pd.notnull(df["zip_code_4"])
full_zip[zip4_exists_mask] += "-" + df.loc[zip4_exists_mask, "zip_code_4"].astype(str).str.strip()
df["shipping_labels"] = (
df["first_name"] + " " + df["last_name"] + "\n"
+ df["address"] + "\n"
+ df["city"] + ", " + df["state"] + " " + full_zip
)
return df
This is a fairly common pattern seen when working with dataframes – create a function that accepts the dataframe and return a modified dataframe. In this case, we added another column to the dataframe called shipping_labels which contains the complete shipping label for each contact.
You can imagine doing something like this in excel. Next to your table you’d create a new column and join these values with CONCAT. You would also need to use IF to handle the zip code logic. Simple enough right?
This function on its own is fairly harmless. Subsequent functions that need the shipping label can accept the dataframe and will need to know the name of the column, but that’s not too difficult right? We can even mention this in a docstring.
My contention though is that this kind of function is actually harmful in a larger project. This is because the function and the dataframe that is being returned isn’t what I consider “expressive”.
From outside this function, it’s not obvious what the most important data in this dataframe is after transformation. The types provided for the function arguments don’t reveal anything either – we passed in a dataframe and we received a dataframe. Was data modified within the dataframe? Was the shape changed with new columns? Were any column types changed? Understanding what happened requires investigation into the function.
Let’s get expressive
So how can we improve this? It’ll take a little extra code, but we can use objects with clearer intent to hold the data if we’re willing to abandon the dataframe when it comes to our core application logic.
from dataclasses import dataclass
from typing import Optional
@dataclass
class Contact:
first_name: str
last_name: str
address: str
city: str
state: str
zip_code_5: str
zip_code_4: Optional[str] = None
@property
def full_zip_code(self) -> str:
return f"{self.zip_code_5}-{self.zip_code_4}" if self.zip_code_4 else self.zip_code_5
def create_contacts_from_dataframe(df: pd.DataFrame) -> list[Contact]:
return [
Contact(
first_name=row.loc["first_name"],
last_name=row.loc["last_name"],
address=row.loc["address"],
city=row.loc["city"],
state=row.loc["state"],
zip_code_5=str(row.loc["zip_code_5"]),
zip_code_4=str(row.loc["zip_code_4"]) if pd.notnull(row.loc["zip_code_4"]) else None,
)
for _, row in df.iterrows()
]
The code above is written to transform the dataframe data we’re loading from the csv into dataclasses. The dataclass clearly defines what we expect to see for a valid contact. The type hints let us know that the 4 digit zip code is optional. This data structure is more expressive than a pandas dataframe.
After we have loaded and cleaned the dataframe, we can pass that dataframe into the create_contacts_from_dataframe function to create a list of Contact objects. For all subsequent code, we can use the Contact objects instead of the dataframe, which removes the need to consider trickier pandas concepts in the core application logic. It also makes it much clearer what data we have available to us.
For example, creating the shipping labels for each contact now could look like the following, which is straightforward python syntax:
def create_shipping_label(contact: Contact) -> str:
"""Create a shipping label for a Contact"""
return (
contact.first_name + " " + contact.last_name + "\n"
+ contact.address + "\n"
+ contact.city + ", " + contact.state + " " + contact.full_zip_code
)
The function signature tells us that we’re expecting a Contact. Because of that, we can get all the helpful features from an IDE to provide intellisense and provide warnings where the types of the values don’t line up their usage (such as trying to concatenate an integer with a string).
Is that extra code worth it?
Is the extra code necessary? I would argue yes in the context of a larger project. It’s highly unlikely this will be the only usage of this contact information in your application as it grows. Two months from now, someone will ask “Hey didn’t we ship a bunch of things to those new customers a while back? Can we also load their information into our CRM? The only thing is that we have a full_name field instead of separate first_name and last_name fields. We also only have one field for zipcode so it will need to be combined like you did for the labels.”
Now as the developer, we’re sweating because, in the “opaque” version, we may have put the first and last name together on one line and combined the zip code on another line, but we put all of that logic together in a single function to put that string in a dataframe column. We can’t really re-use that logic.
But in our “expressive” version, instead of creating yet another opaque function that accepts a dataframe and returns a dataframe to perform these steps, we can extend the functionality of our Contact class:
@dataclass
class Contact:
first_name: str
last_name: str
address: str
city: str
state: str
zip_code_5: str
zip_code_4: Optional[str] = None
@property
def full_zip_code(self) -> str:
return f"{self.zip_code_5}-{self.zip_code_4}" if self.zip_code_4 else self.zip_code_5
# new code
@property
def full_name(self) -> str:
return self.first_name + " " + self.last_name
And now our Contact objects can be passed around and clearly show that full_name and full_zip_code are available at any time. We can complete this new request with a new function using typical python syntax:
import json
def get_json_for_crm_contact(contact: Contact) -> str:
return json.dumps({
"full_name": contact.full_name,
"address": contact.address,
"city": contact.city,
"state": contact.state,
"zip_code": contact.full_zip_code,
})
Looking at the above, someone familiar with pandas might know that pandas has a handy to_json method and be tempted to stay in “dataframe land” so they can use that instead. They’ll need a function that creates the full name column and ignores irrelevant columns before calling to_json, of course, but that’s “not complicated”. Unfortunately, similarly opaque functions will continue to accumulate as the list of features grows, and the overall complexity will increase as business logic gets “locked up” inside of these dataframe transformation functions.
This example may seem trivial and contrived but in my experience I have seen many projects accumulate technical debt because they fail to take the extra steps to make explicit data structures that pertain to their domain, instead relying on dataframes as the central data structure. The solution to this problem can look quite different than what I’ve shown here, but I believe it’s important to try to avoid the problem I’ve presented.
The Swiss-army knife that is pandas may seem like a great choice for your application logic too, but it’s a double-edged sword. The cost will be paid in developer “investigation” time as things evolve. Symptoms will include constantly trying to introspect the state of your dataframes in runtime debugging sessions or reading every line of the call stack to try to follow the dataframe transformations.
I believe it’s worth the extra effort and lines of code to make explicit data structures for information in your domain and limit pandas dataframes to the edges of your application for things like loading, cleaning, and exporting (unless you’re working in the analytics space). Hopefully this example will help you avoid some DataFrame-related headaches in the future.
Loved the article? Hated it? Didn’t even read it?
We’d love to hear from you.