Long/wide-form data transformations

Sun Sep 20 2020•2 min read

Two leading data analytics program, Python and R can do almost the same jobs despite different terms or ways to accomplish them.

The title of this article totally fits into this category and from what I learned about R, gather and spread are two basic tools in data analytics. However, in Python, unless one is going to do some data analysis work, the equivalent arsenals aren't so familiar.

Wide to long-form

This is rather straightforward. Both tools have the same coined function, melt although it's not the only one. Whether we use melt or not, the syntax seems pretty similar in both languages.

# R - Wide to long-form (tidyr::gather)
tidyr::gather(data,key='item_type',value='item_value',-Excepted)

# Python - Wide to long-form (pandas.melt)
pandas.melt(data,id_vars=['Excepted'])

As can be seen above, gather takes names of columns to be index ones with - prefix, and set the resulting column names on-the-fly. On the other hand, Python's melt just takes index columns only.

Long to wide-form

This is where R shines, as it's rather easier than its counterpart. The counterpart of gather from tidyr library, spread does with the simple syntax, while it may not be so simple in Python universe.

# R - Long to wide-form (tidyr::spread)
tidyr::spread(data,item_type,item_value)

However, in Python, the action involves a little follow-ups. It is generally done with pandas's pivot, which is the scaled-down version of its powerful brother, pivot-table, which is capable of aggregations, while pivot isn't.

# Python - Long to wide-form (pandas.pivot)
index_cols = list(data.columns)
index_cols.remove('item_type')
index_cols.remove('item_value')
 
pandas.pivot(data,index=index_cols,columns='item_type',values='item_value').reset_index()

As pivot and pivot_table generates index columns, reset_index() should be followed to make the dataframe to the R-equivalent version.