Feb 2023
I wanted to write a simple example for how one might go from pandas to a more distributed computing framework like pyspark, and do many of the same types of things (transforms, filterings, regex cleaning, etc). Below I'll walk you through how to do merging and conversions and do a comparison of pandas versus pyspark using the dataframes api. There's also the full notebook code here.
You can see pandas and pyspark are relatively similar but that pyspark is more verbose. This is a pattern you're going to see repeated and I'll explain why as we look through the other examples.
This is a conditional transform where depending on what value a particular eleement has one or another value will be output.
Here if the value of the element is 1 we will transform it to `True`, and if it's anything else it will become `False`. This is intended to take our nuerical column and convert it to boolean values. We could of course cast this to a bool but that might produce unintended effects. If some element does not readily cast, it may be set as a null. This method makes sure everything in there is either True or False. Again notice the longer syntax and special functions for working with colums in pyspark.
There is more going on here so follow closely: in the pandas section I've checked to see whether all the values in the column are numbers in two different ways. First by casting values to numeric and counting them and then in the second by using a regex. I check either of these to see if we can meet the condition and cast the whole column to numeric if we do. In pyspark the idea here is to cast anything that can be cast to float, anything that can't get's filtered out by `df.filter` + isNotNull(), then count how many things we have. If that number is the same as the original dataframe the whole column can cast.
In the pyspark version I only did the counting, not the string regex matching. I just wanted to show some extra logic for how regex works in pandas since it's not used in the later examples much, but you can ignore 'all_are_nums' variable.
While the syntax differs you can clearly see that this code does the same thing in both versions. And here the column data actually won't meet our condition (you can see how I generated it in the notebook. So we will move onto the next section...
Here we filter the column data for any character that would be invalid for a number in our dataset. We remove these and then cast what remains to a numeric value.
you can see clearly that the code in both replaces any character not matching the regex, and then once that's done performs a cast on the remaining data
Here I wanted to show the difference between a user defined function in both frameworks. What you'll see right away is that in pandas you can simply apply a labda function no other calls are needed. Here I've checked for various kinds of null value and maintained those. You have to check for these or fill them or this apply will crash.
Likewise in the pyspark function you can see very siilar behavior. I actually used when unecessarily here to illustrate some behavior that may be counterintuitive. The F.when function while it will produce different behavior for different colun values, will still pass every column value to the udf. So you still need to check for null values in your udf.
`df = df.withColumn(col4, int_udf(F.col(col4)))` would actually have the same output with less complexity because you simply put your null handling logic in the udf and apply it to the whole column.
Here the pandas code is quite simple: it looks for a date-like thing in the column and converts it to pandas native datetime format. Pyspark is a little more complex it uses a regex to match a complex date pattern (so things with / or -) then if it finds something it replaces that with an empty string. You can think of this as a mask the data wears just for the benefit of the F.when function, the data is not actually replaced, then we check for this empty string with isNotNull() (which returns true for an empty string), if we find it, we take the original data and cast it to a date.
If youre date pattern is simple you can drop most of this logic and just use `df = df.withColumn(col5, F.to_timestamp(F.col(col5), 'yyyy/MM/dd'))` but I wanted to show a complex example as real world data is messy.
The main reason for additional syntax in pyspark is to account for the fact that the dataset is distributed and requires additional interfaces to be used to represent the operations you are doing on it.
This example was simple but your data problems may be complex.
For any of your complex data problems I'm available to hire on contract to help you build and scale whatever data decision engine you need for your business.
Reach out by messaging inquire@automatedinnovations.net