{"id":329587,"date":"2024-05-06T13:57:46","date_gmt":"2024-05-06T13:57:46","guid":{"rendered":"https:\/\/www.blog.pythonlibrary.org\/?p=12313"},"modified":"2024-05-06T13:57:46","modified_gmt":"2024-05-06T13:57:46","slug":"how-to-read-and-write-parquet-files-with-python","status":"publish","type":"post","link":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/2024\/05\/06\/how-to-read-and-write-parquet-files-with-python\/","title":{"rendered":"How to Read and Write Parquet Files with Python"},"content":{"rendered":"<p class=\"syndicated-attribution\"><meta name= \\\"keywords \\\" content= \\\"\u96fb\u5b50\u8a08\u7b97\u6a5f, \u6559\u80b2, IT \u96fb\u8166\u73ed,\u96fb\u8166\u88dc\u7fd2\uff0c \u96fb\u8166\u73ed\uff0c \u5bb6\u6559\uff0c \u79c1\u4eba\u8001\u5e2b\uff0c \u8cc7\u8a0a\u6280\u8853\uff0c \u7a0b\u5e8f\u8a2d\u8a08\uff0c \u96fb\u5b50\u8a08\u7b97\u6a5f\uff0c \u904a\u6232\uff0c \u860b\u679c\uff0c \u96fb\u5f71\uff0c \u8a08\u7b97\u6a5f\uff0c\u7de8\u78bc\uff0c Java\uff0c C\/C++\uff0c JavaScript\uff0c PHP\uff0c HTML\uff0c CSS\uff0c MySQL\uff0c mobile\uff0c Android\uff0c \u52d5\u6f2b\uff0c Python\uff0c teacher\uff0c \u88dc\u7fd2\uff0c \u96fb\u8166\u88dc\u7fd2 \u8cc7\u8a0a, \u7535\u5b50\u8ba1\u7b97\u673a, IT ,Game, apple, movie, Computer,student,Java,\u6559\u80b2, ,\u5b66\u751f, \u5b66\u4e60, learn, \u6559\u5b66,  Android, apple,anime, animation, \u4fe1\u606f\u6280\u672f, \u7a0b\u5e8f\u8bbe\u8ba1, \u79fb\u52a8\u7535\u8bdd, \u8cc7\u8a0a\u79d1\u6280,Game, Jeu, Juego,Call Of Duty ,\u4f7f\u547d\u53ec\u559a , \u6e38\u620f, \u7535\u5b50\u6e38\u620f,, \u591a\u4eba\u7535\u5b50\u6e38\u620f, \u7f51\u7edc\u6e38\u620f\uff0conline\uff0conline game, \u624b\u673a\u6e38\u620f, mobile \\\"><\/p>\n<p>Apache Parquet files are a popular columnar storage format used by data scientists and anyone using the Hadoop ecosystem. It was developed to be very efficient in terms of compression and encoding. Check out their <a href=\"https:\/\/parquet.apache.org\/\">documentation<\/a> if you want to know all the details about how Parquet files work.<\/p>\n<p>You can read and write Parquet files with Python using the <a href=\"https:\/\/arrow.apache.org\/docs\/python\/index.html\">pyarrow package<\/a>.<\/p>\n<p>Let&#8217;s learn how that works now!<\/p>\n<h2>Installing pyarrow<\/h2>\n<p>The first step is to make sure you have everything you need. In addition to the Python programming language, you will also need <strong>pyarrow<\/strong> and the <strong>pandas<\/strong> package. You will use pandas because it is another Python package that uses columns as a data format and works well with Parquet files.<\/p>\n<p>You can use pip to install both of these packages. Open up your terminal and run the following command:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">python -m pip install pyarrow pandas<\/pre>\n<p>If you use Anaconda, you&#8217;ll want to install pyarrow using this command instead.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">conda install -c conda-forge pyarrow<\/pre>\n<p>Anaconda should already include pandas, but if not, you can use the same command above by replacing pyarrow with pandas.<\/p>\n<p>Now that you have pyarrow and pandas installed, you can use it to read and write Parquet files!<\/p>\n<h2>Writing Parquet Files with Python<\/h2>\n<p>Writing Parquet files with Python is pretty straightforward. The code to turn a pandas DataFrame into a Parquet file is about ten lines.<\/p>\n<p>Open up your favorite Python IDE or text editor and create a new file. You can name it something like <code>parquet_file_writer.py<\/code>or use some other descriptive name. Then enter the following code:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import pandas as pd\r\nimport pyarrow as pa\r\nimport pyarrow.parquet as pq\r\n\r\n\r\ndef write_parquet(df: pd.DataFrame, filename: str) -&gt; None:\r\n    table = pa.Table.from_pandas(df)\r\n    pq.write_table(table, filename)\r\n    \r\n\r\nif __name__ == \"__main__\":\r\n    data = {\"Languages\": [\"Python\", \"Ruby\", \"C++\"],\r\n            \"Users\": [10000, 5000, 8000],\r\n            \"Dynamic\": [True, True, False],\r\n            }\r\n    df = pd.DataFrame(data=data, index=list(range(1, 4)))\r\n    write_parquet(df, \"languages.parquet\")<\/pre>\n<p>For this example, you have three imports:<\/p>\n<ul>\n<li>One for <code>pandas<\/code>, so you can create a <code>DataFrame<\/code><\/li>\n<li>One for <code>pyarrow<\/code>, to create a special <code>pyarrow.Table<\/code> object<\/li>\n<li>One for <code>pyarrow.parquet<\/code>to transform the table object into a Parquet file<\/li>\n<\/ul>\n<p>The\u00a0\u00a0<strong>write_parquet()<\/strong> function takes in a pandas DataFrame and the file name or path to save the Parquet file to. Then, you transform the DataFrame into a pyarrow Table object before converting that into a Parquet File using the <code>write_table()<\/code> method, which writes it to disk.<\/p>\n<p>Now you are ready to read that file you just created!<\/p>\n<h2>Reading Parquet Files with Python<\/h2>\n<p>Reading the Parquet file you created earlier with Python is even easier. You&#8217;ll need about half as many lines of code!<\/p>\n<p>You can put the following code into a new file called something like <code>parquet_file_reader.py<\/code> if you want to:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import pyarrow.parquet as pq\r\n\r\ndef read_parquet(filename: str) -&gt; None:\r\n    table = pq.read_table(filename)\r\n    df = table.to_pandas()\r\n    print(df)\r\n\r\nif __name__ == \"__main__\":    \r\n    read_parquet(\"languages.parquet\")<\/pre>\n<p>In this example, you read the Parquet file into a pyarrow Table format and then convert it to a pandas DataFrame using the Table&#8217;s <strong>to_pandas()<\/strong> method.<\/p>\n<p>When you print out the contents of the DataFrame, you will see the following:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">  Languages  Users  Dynamic\r\n1    Python  10000     True\r\n2      Ruby   5000     True\r\n3       C++   8000    False<\/pre>\n<p>You can see from the output above that the DataFrame contains all data you saved.<\/p>\n<p>One of the strengths of using a Parquet file is that you can read just parts of the file instead of the whole thing. For example, you can read in just some of the columns rather then the whole file!<\/p>\n<p>Here&#8217;s an example of how that works:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import pyarrow.parquet as pq\r\n\r\ndef read_columns(filename: str, columns: list[str]) -&gt; None:\r\n    table = pq.read_table(filename, columns=columns)\r\n    print(table)\r\n\r\nif __name__ == \"__main__\":\r\n    read_columns(\"languages.parquet\", columns=[\"Languages\", \"Users\"])<\/pre>\n<p>To read in just the &#8220;Languages&#8221; and &#8220;Users&#8221; columns from the Parquet file, you pass in the a list that contains just those column names. Then when you call\u00a0<strong>read_table()<\/strong> you pass in the columns you want to read.<\/p>\n<p>Here&#8217;s the output when you run this code:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">pyarrow.Table\r\nLanguages: string\r\nUsers: int64\r\n----\r\nLanguages: [[\"Python\",\"Ruby\",\"C++\"]]\r\nUsers: [[10000,5000,8000]]\r\n<\/pre>\n<p>This outputs the pyarrow Table format, which differs slightly from a pandas DataFrame. It tells you information about the different columns; for example, Languages are strings, and Users are of type int64.<\/p>\n<p>If you prefer to work only with pandas DataFrames, the pyarrow package allows that too. As long as you know the Parquet file contains pandas DataFrames, you can use\u00a0<strong>read_pandas()<\/strong> instead of\u00a0<strong>read_table().<\/strong><\/p>\n<p>Here&#8217;s a code example:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import pyarrow.parquet as pq\r\n\r\ndef read_columns_pandas(filename: str, columns: list[str]) -&gt; None:\r\n    table = pq.read_pandas(filename, columns=columns)\r\n    df = table.to_pandas()\r\n    print(df)\r\n\r\nif __name__ == \"__main__\":\r\n    read_columns_pandas(\"languages.parquet\", columns=[\"Languages\", \"Users\"])<\/pre>\n<p>When you run this example, the output is a DataFrame that contains just the columns you asked for:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">  Languages  Users\r\n1    Python  10000\r\n2      Ruby   5000\r\n3       C++   8000<\/pre>\n<p>One advantage of using the read_pandas() and to_pandas() methods is that they will maintain any additional index column data in the DataFrame,\u00a0while the pyarrow Table may not.<\/p>\n<h2>Reading Parquet File Metadata<\/h2>\n<p>You can also get the metadata from a Parquet file using Python. Getting the metadata can be useful when you need to inspect an unfamiliar Parquet file to see what type(s) of data it contains.<\/p>\n<p>Here&#8217;s a small code snippet that will read the Parquet file&#8217;s metadata and schema:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import pyarrow.parquet as pq\r\n\r\ndef read_metadata(filename: str) -&gt; None:\r\n    parquet_file = pq.ParquetFile(filename)\r\n    metadata =  parquet_file.metadata\r\n    print(metadata)\r\n    print(f\"Parquet file: {filename} Schema\")\r\n    print(parquet_file.schema)\r\n\r\nif __name__ == \"__main__\":\r\n    read_metadata(\"languages.parquet\")<\/pre>\n<p>There are two ways to get the Parquet file&#8217;s metadata:<\/p>\n<ul>\n<li>Use <strong>pq.ParquetFile<\/strong> to read the file and then access the\u00a0<strong>metadata<\/strong> property<\/li>\n<li>Use\u00a0<strong>pr.read_metadata(filename)<\/strong> instead<\/li>\n<\/ul>\n<p>The benefit of the former method is that you can also access the\u00a0<strong>schema<\/strong> property of the <strong>ParquetFile<\/strong> object.<\/p>\n<p>When you run this code, you will see this output:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">&lt;pyarrow._parquet.FileMetaData object at 0x000002312C1355D0&gt;\r\n  created_by: parquet-cpp-arrow version 15.0.2\r\n  num_columns: 4\r\n  num_rows: 3\r\n  num_row_groups: 1\r\n  format_version: 2.6\r\n  serialized_size: 2682\r\nParquet file: languages.parquet Schema\r\n&lt;pyarrow._parquet.ParquetSchema object at 0x000002312BBFDF00&gt;\r\nrequired group field_id=-1 schema {\r\n  optional binary field_id=-1 Languages (String);\r\n  optional int64 field_id=-1 Users;\r\n  optional boolean field_id=-1 Dynamic;\r\n  optional int64 field_id=-1 __index_level_0__;\r\n}<\/pre>\n<p>Nice! You can read the output above to learn the number of rows and columns of data and the size of the data. The schema tells you what the field types are.<\/p>\n<h2>Wrapping Up<\/h2>\n<p>Parquet files are becoming more popular in big data and data science-related fields. Python&#8217;s pyarrow package makes working with Parquet files easy. You should spend some time experimenting with the code in this tutorial and using it for some of your own Parquet files.<\/p>\n<p>When you want to learn more, check out the <a href=\"https:\/\/arrow.apache.org\/docs\/python\/install.html\">Parquet documentation<\/a>.<\/p>\n<p>The post <a href=\"https:\/\/www.blog.pythonlibrary.org\/2024\/05\/06\/how-to-read-and-write-parquet-files-with-python\/\">How to Read and Write Parquet Files with Python<\/a> appeared first on <a href=\"https:\/\/www.blog.pythonlibrary.org\/\">Mouse Vs Python<\/a>.<\/p>\n\n<p class=\"syndicated-attribution\"><figure class= \\\"wp-block-image alignnone \\\"><img src= \\\"http:\/\/itteacheritfreelance.hk\/test\/wordpress\/wp-content\/uploads\/2016\/05\/logo2-2.png\\\" alt=\\\"IT\u96fb\u8166\u88dc\u7fd2 java\u88dc\u7fd2 \u70ba\u5927\u5bb6\u914d\u5c0d\u96fb\u8166\u88dc\u7fd2,IT freelance, \u79c1\u4eba\u8001\u5e2b, PHP\u88dc\u7fd2,CSS\u88dc\u7fd2,XML,Java\u88dc\u7fd2,MySQL\u88dc\u7fd2,graphic design\u88dc\u7fd2,\u4e2d\u5c0f\u5b78ICT\u88dc\u7fd2,\u4e00\u5c0d\u4e00\u79c1\u4eba\u88dc\u7fd2\u548cFreelance\u81ea\u7531\u5de5\u4f5c\u914d\u5c0d\u3002\\\"\/><figcaption>\u7acb\u523b\u8a3b\u518a\u53ca\u5831\u540d\u96fb\u8166\u88dc\u7fd2\u8ab2\u7a0b\u5427!<\/figcaption><\/figure>\r\n<\/br>Find A Teacher Form:\r\n<\/br>https:\/\/docs.google.com\/forms\/d\/1vREBnX5n262umf4wU5U2pyTwvk9O-JrAgblA-wH9GFQ\/viewform?edit_requested=true#responses\r\n<\/br><\/br>Email:\r\n<\/br>public1989two@gmail.com<br><br><br><br><br><br><br>\r\n<a href=www.itsec.hk style=color:#FFFFFF;>www.itsec.hk<\/a><br>\r\n<a href=\\\"www.itsec.vip\\\" style=color:#FFFFFF;>www.itsec.vip<\/a><br>\r\n<a href=\\\"www.itseceu.uk\\\" style=color:#FFFFFF;>www.itseceu.uk<\/a><br><\/p>","protected":false},"excerpt":{"rendered":"<div class=\"mh-excerpt\"><p>Apache Parquet files are a popular columnar storage format used by data scientists and anyone using the Hadoop ecosystem. It was developed to be very efficient in terms of compression and encoding. Check out their documentation if you want to know all the details about how Parquet files work. You can read and write Parquet [\u2026]<\/p>\n<p>The post <a href=\"https:\/\/www.blog.pythonlibrary.org\/2024\/05\/06\/how-to-read-and-write-parquet-files-with-python\/\">How to Read and Write Parquet Files with Python<\/a> appeared first on <a href=\"https:\/\/www.blog.pythonlibrary.org\/\">Mouse Vs Python<\/a>.<\/p>\n<\/div>","protected":false},"author":2018,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"slim_seo":{"title":"How to Read and Write Parquet Files with Python - ITTeacherITFreelance.hk","description":"Apache Parquet files are a popular columnar storage format used by data scientists and anyone using the Hadoop ecosystem. It was developed to be very efficient"},"footnotes":""},"categories":[10700],"tags":[],"_links":{"self":[{"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/329587"}],"collection":[{"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/users\/2018"}],"replies":[{"embeddable":true,"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/comments?post=329587"}],"version-history":[{"count":1,"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/329587\/revisions"}],"predecessor-version":[{"id":329588,"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/329587\/revisions\/329588"}],"wp:attachment":[{"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/media?parent=329587"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/categories?post=329587"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/tags?post=329587"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}