Wednesday, September 27, 2023
HomeArtificial IntelligenceUtilizing DuckDB with Polars. Discover ways to use SQL to question your…...

Utilizing DuckDB with Polars. Discover ways to use SQL to question your… | by Wei-Meng Lee | Apr, 2023


Photograph by Hans-Jurgen Mager on Unsplash

In my last few articles on knowledge analytics, I speak about two essential up-and-coming libraries which can be presently gaining plenty of tractions within the trade:

  • DuckDB — the place you’ll be able to question your dataset in-memory utilizing SQL statements.
  • Polars — a way more environment friendly DataFrame library in comparison with the venerable Pandas library.

What about combining the facility of those two libraries?

In truth, you’ll be able to immediately question a Polars dataframe via DuckDB, utilizing SQL statements.

So what are the advantages of querying your Polars dataframe utilizing SQL? Regardless of the benefit of use, manipulating Polars dataframes nonetheless require a little bit of practise and a comparatively steep studying curve. However since most builders are already aware of SQL, isn’t it extra handy to control the dataframes immediately utilizing SQL? Utilizing this strategy, builders have the most effective of each worlds:

  • the power to question Polars dataframes utilizing all the varied capabilities, or
  • use SQL for instances the place it’s rather more pure and simpler to extract the info that they need

On this article, I offers you some examples of how one can make use of SQL via DuckDB to question your Polars dataframes.

For this text, I’m utilizing Jupyter Pocket book. Guarantee that you’ve put in Polars and DuckDB utilizing the next instructions:

!pip set up polars
!pip set up duckdb

To get began, let’s create a Polars DataFrame by hand:

import polars as pl

df = pl.DataFrame(
{
'Mannequin': ['iPhone X','iPhone XS','iPhone 12',
'iPhone 13','Samsung S11',
'Samsung S12','Mi A1','Mi A2'],
'Gross sales': [80,170,130,205,400,30,14,8],
'Firm': ['Apple','Apple','Apple','Apple',
'Samsung','Samsung','Xiao Mi',
'Xiao Mi'],
})
df

Right here’s how the dataframe seems:

All photographs by writer

Say, you now wish to discover all telephones from Apple which has gross sales of greater than 80. You should use the filter() perform in Polars, like this:

df.filter(
(pl.col('Firm') == 'Apple') &
(pl.col('Gross sales') > 80)
)

And the end result seems like this:

Let’s now do the precise question that we did within the earlier part, besides that this time spherical we are going to use DuckDB with a SQL assertion. However first, let’s choose all of the rows within the dataframe:

import duckdb

end result = duckdb.sql('SELECT * FROM df')
end result

You may immediately reference the df dataframe out of your SQL assertion.

Utilizing DuckDB, you challenge a SQL assertion utilizing the sql() perform. Alternatively, the question() perform additionally works:

end result = duckdb.question('SELECT * FROM df')

The end result variable is a duckdb.DuckDBPyRelation object. Utilizing this object, you’ll be able to carry out fairly various totally different duties, akin to:

  • Getting the imply of the Gross sales column:
end result.imply('Gross sales')
  • Describing the dataframe:
end result.describe()
  • Making use of a scaler perform to the columns within the dataframe:
end result.apply("max", 'Gross sales,Firm')
  • Reordering the dataframe:
end result.order('Gross sales DESC')

However the simplest way is to question the Polars DataFrame is to make use of SQL immediately.

For instance, if you wish to get all of the rows with gross sales better than 80, merely use the sql() perform with the SQL assertion beneath:

duckdb.sql('SELECT * FROM df WHERE Gross sales >80').pl()

The pl() perform converts the duckdb.DuckDBPyRelation object to a Polars DataFrame. If you wish to convert it to a Pandas DataFrame as a substitute, use the df() perform.

If you wish to get all of the rows whose mannequin title begins with “iPhone”, then use the next SQL assertion:

duckdb.sql("SELECT * FROM df WHERE Mannequin LIKE 'iPhone%'").pl()

If you need all gadgets from Apple and Xiao Mi, then use the next SQL assertion:

duckdb.sql("SELECT * FROM df WHERE Firm = 'Apple' OR Firm ='Xiao Mi'").pl()

The true energy of utilizing DuckDB with Polars DataFrame is once you wish to question from a number of dataframes. Think about the next three CSV recordsdata from the 2015 Flights Delay dataset:

2015 Flights Delay datasethttps://www.kaggle.com/datasets/usdot/flight-delays. Licensing — CC0: Public Area

  • flights.csv
  • airways.csv
  • airports.csv

Let’s load them up utilizing Polars:

import polars as pl

df_flights = pl.scan_csv('flights.csv')
df_airlines = pl.scan_csv('airways.csv')
df_airports = pl.scan_csv('airports.csv')

show(df_flights.accumulate().head())
show(df_airlines.accumulate().head())
show(df_airports.accumulate().head())

The above statements use lazy analysis to load up the three CSV recordsdata. This ensures that any queries on the dataframes usually are not carried out till all of the queries are optimized. The accumulate() perform forces Polars to load the CSV recordsdata into dataframes.

Right here is how the df_flights, df_airlines, and df_airports dataframes seem like:

Suppose you wish to rely the variety of occasions an airline has a delay , and on the similar time show the title of every airline, right here is the SQL assertion that you should use utilizing the df_airlines and df_flights dataframes:

duckdb.sql('''
SELECT
rely(df_airlines.AIRLINE) as Depend,
df_airlines.AIRLINE
FROM df_flights, df_airlines
WHERE df_airlines.IATA_CODE = df_flights.AIRLINE AND df_flights.ARRIVAL_DELAY > 0
GROUP BY df_airlines.AIRLINE
ORDER BY COUNT DESC
''')

And right here is the end result:

If you wish to rely the variety of airports in every state and type the rely in descending order, you should use the next SQL assertion:

duckdb.sql('''
SELECT STATE, Depend(*) as AIRPORT_COUNT
FROM df_airports
GROUP BY STATE
ORDER BY AIRPORT_COUNT DESC
''')

Lastly, suppose you wish to know which airline has the very best common delay. You should use the next SQL assertion to calculate the varied statistics, akin to minimal arrival delay, most array delay, imply arrival delay, and commonplace deviation of arrival delay:

duckdb.sql('''
SELECT AIRLINE, MIN(ARRIVAL_DELAY), MAX(ARRIVAL_DELAY),
MEAN(ARRIVAL_DELAY), stddev(ARRIVAL_DELAY)
FROM df_flights
GROUP BY AIRLINE
ORDER BY MEAN(ARRIVAL_DELAY)
''')

Based mostly on the imply arrival delay, we will see that the AS airline is the one with the shortest delay (as the worth is destructive, this implies more often than not it arrives earlier!) and NK airline is the one with the longest delay. Wish to know what’s the AS airline? Attempt it out utilizing what you’ve got simply discovered! I’ll depart it as an train and the reply is on the finish of this text.

When you like studying my articles and that it helped your profession/examine, please think about signing up as a Medium member. It’s $5 a month, and it offers you limitless entry to all of the articles (together with mine) on Medium. When you enroll utilizing the next hyperlink, I’ll earn a small fee (at no further price to you). Your assist signifies that I can dedicate extra time on writing articles like this.

On this brief article, I illustrated how DuckDB and Polars can be utilized collectively to question your dataframes. Using each libraries offers you the most effective of each worlds — utilizing a well-known querying language (which is SQL) to question an environment friendly dataframe. Go forward and check out it out utilizing your personal dataset and share with us the way it has helped your knowledge analytics processes.

Reply to quiz:

duckdb.sql("SELECT AIRLINE from df_airlines WHERE IATA_CODE = 'AS'")
RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -

Most Popular

Recent Comments