Using request header in pandas.read_csv() for remote access

1 min read

For Python developers or analysts, it is well known that we can use pandas to access files remotely (i.e., GitHub).

But, one thing which not everyone knows is that we can use it with some custom request header as we access with requests library.

Why?

It may not be clear for someone who only tries to access files from GitHub, which doesn't require the developer to set request header.

But, many sites restrict the access without any header, especially with the User-Agent property.

How?

We are so aware to read a CSV file using pandas like the following.

import pandas as pd
 
data = pd.read_csv('https://website.com/file.csv')

So, what needs to be done to supply request header, then? The storage_options comes to the rescue (added in pandas 1.3.0).

You can pass the header value with this new parameter.

import pandas as pd
 
headers = {'User-Agent':'Some browser info'}
data = pd.read_csv('https://website.com/file.csv', storage_options=headers)

With this, you can access publicly available remote CSV file with pandas. However, it cannot be used to access files in private GitHub repository cannot be accessed (It's more complicated and an article about the topic will be posted later).

CC BY-NC 4.0 © min park.RSS