Polars: The Next Big Python Data Science Library... written in RUST?

167,238
0
2022-12-29に共有
In this video tutorial I explain everything you need to get started coding with polars. Polars is a multi-threaded DataFrame library, meaning that it allows using all the cores of a computer at the same time to achieve its full processing potential. It's been shown to have huge performance gains over pandas.

Timeline:
00:00 Intro
01:00 What is Polars?
02:43 Getting Started
06:32 Filtering
07:15 New Columns
08:10 Groupby
08:55 Combining Dataframes
10:17 Multithreaded Approach
11:21 Speed Test
12:50 Takeaways

Follow me on twitch for live coding streams: www.twitch.tv/medallionstallion_

My other videos:

Speed Up Your Pandas Code:    • Make Your Pandas Code Lightning Fast  
Intro to Pandas video:    • A Gentle Introduction to Pandas Data ...  
Exploratory Data Analysis Video:    • Exploratory Data Analysis with Pandas...  

Working with Audio data in Python:    • Audio Data Processing in Python  
Efficient Pandas Dataframes:    • Speed Up Your Pandas Dataframes  

* Youtube: youtube.com/@robmulla?sub_confirmation=1
* Discord: discord.gg/HZszek7DQc
* Twitch: www.twitch.tv/medallionstallion_
* Twitter: twitter.com/Rob_Mulla
* Kaggle: www.kaggle.com/robikscube

#python #polars #datascience

コメント (21)
  • Polars is built on top of Apache Arrow which pandas supports. So you can easily convert your polars dataframe to pandas with almost zero overhead. I use polars to do the hard part and jump back to pandas for the visualization stuff
  • 10000 points for printing the version. Every tutorial video should do that.
  • @brd5548
    Our team tried to integrate polars into our analytics pipeline last year, and the result was kinda on and off. To be honest, the performance of pandas is not that bad, we spent some time on doing several fine tunings, like rewriting key bottlenecks with our native modules or with these vectorized pandas methods, and the result turned out just ok. On the other hand, the integration work of polars did require some major revamping and refactoring, due to API gaps and implementation differences between the two. However, the performance gains didn't seem to justify the effort. What's worse, while pandas does come with pitfalls and caveats here and there, polars is a relatively young project and it comes with bugs on basic text manipulating operations. But don't get me wrong, that was my experience last year. I do think polars has the potential. It has a much more robust and modern architecture than pandas in my opinion. Its API style is cleaner and more consistent. And it comes with a query optimization engine, which many users can appreciate if you are familiar with tools like apache spark or some databases. Given time, I think polars should become another powerful player in the future. So, definitely give it a try if you're building something new!
  • 13:20 Regarding learning the syntax… It’s worth mentioning that Polars syntax is very similar to PySpark, so it’s really two birds with one stone.
  • Nice video. Very interesting to see how polar works, hope to see it more frequent in your future streams to learn more about the practical use.
  • Great timing, I was looking to start playing with Polars since Mark Tenenholtz mentioned it some days ago. I went back to Pandas because couldn't find the assign() and astype() equivalents in Polars, I thought they were lacking, but they seem to be with_columns() and cast(). Now I will resume more persistently.
  • I saw some tweets about Polars but seeing it in action is something else Also, I can't believe it took me this long to find your channel, subbed!
  • Thanks for a good explanation of how Polars could benefit people who use Pandas and need more speed. In my project we already have a heavy emphasis on multi processing and fast inter process communication, so I am especially interested to see a Pandas vs Polar single core performance comparison for group and join. I hope that someone does the comparison and posts it to Youtube.
  • @jcbritobr
    Nice stuff. This Polars seems a killer tool. Thank you for share.
  • Thanks for brining this to my attention, I think I might include polars into some productionalization processes. For data exploration, typically I only use parts of dataframes for plotting or investigation. Given that you can convert a polars dataframe to pandas, it seems like a good approach would be to have the the full dataset in polars and then filter into a pandas dataframe and plot.
  • @juan.o.p.
    Thanks for the recommendation, I will definitely give it a try 😊
  • @tmb8807
    I'm blown away by how fast this is. Sure there are some things it can't do, but man, even for just reading large data sets it's absolutely blazing.
  • @GiasoneP
    Like PySpark AND Pandas. Second half mirrors PySpark. Due to the speed, and out of the box parralelization, I wonder how it stacks up against Spark and how it’s functionality compares to a cluster of machines. Take AWS for example, can it be applied to an EMR cluster? As a side note, I’m super excited about Rust and it’s future in data.
  • Hi @rub, I think it's a good approach to diversity our tools this days, especially when it comes to deal with memory (sometimes I find myself running out of time with pandas)
  • DataTable is also pretty legendary, you might also find it super awesome. Thanks again for your amazing videos, I have watched and learned from every one of them. I hope I'll interview you about your 100k celebration sometime next year 🙏
  • I love your work. You should have a course on data science.. for folks like us just learning
  • @rackstar2
    I recently decided to fully transition over to using polars instead of pandas for a data pipeline project. The primary reason im liking polars over pandas is not just the speed (the speed is nice dont get me wrong) but its the Space usage! Allmost all of my operations entailed working with data larger than memory. One of the operations i have to do is pivoting a dataframe. My end result has thousands of columns! My kernel never seems to hold steady when doing this with pandas, but polars is really doing the trick for me. One small problem did face tho is when it comes to exporting the results of the pipline. I still have to resort to something like pyarrow and use its writer to do the export in chunks. This might just be because of how low my system memory is. Regardless of this, polars seems to be an excellent option for data processing and manipulation, and if you do want to showcase your data, you can always convert back and forth with pandas !