This script allows user to convert different data types.
This tool accepts comma-separated value files (.csv) as well as apache parquet (.parquet) files. It is assumed that the first row of the spreadsheet is the location of the columns.
This script requires that pandas
, pyarrow
, argparse
and pathlib
be installed within the Python environment you are running this script in.
This file can also be imported as a module and contains the following functions:
csv_to_parquet
- convert csv to parquet and save to fileparquet_to_csv
- convert parquet to csv and save to fileparquet_schema
- returns schema of parquet fileadd_filename_suffix
- returns filename string with added suffix for filename and change extensionis_file_ext_correct
- returns returns True if filename has correct file extension and prints message otherwiseprint_success_message
- prints message of successfully conversion with elapsed timeconstruct_argument_parser
- constructs the argument parsermain
- the main function of the script
-h
,--help
show help message and exit-cp
,--csv2parquet
convert csv to parquet. Set input csv filename string (example: data.csv)-pc
,--parquet2csv
convert parquet to csv. Set input parquet filename string (example: data.parquet)-s
,--get_schema
get schema of parquet file. Set input parquet filename string (example: data.parquet)-o
,--output
set output file name without extension (example: newfile)-d
,--delimiter
set delimiter for csv file (default: ,)
Assume that you have some data in your data.csv
file:
id,first_name,second_name,age
0,Vitaliy,Povstenko,19
1,John,Doe,25
2,Bill,Gates,40
3,Elon,Musk,30
4,Don,Joel,25
need to convert some data.csv
file to parquet
, so you need to write the following command:
$python convertor.py --csv2parquet data.csv
And you receive new file data_converted.parquet
in your directory
You can specify --parquet2csv
parameter in order to convert data_converted.parquet
file back to csv
$python convertor.py --parquet2csv data_converted.parquet
New data_converted_converted.parquet
now added to your directory, but it is good to specify the output file name (without extension) in parameter --output
:
$python convertor.py --parquet2csv data_converted.parquet --output newfile
Successfully converted from data_converted.parquet
to newfile.csv
.
In order, you need to save or read CSV files using a special delimiter:
$python convertor.py -pc data_converted.parquet -o newfile --delimiter ;
Now file newfile.csv
id delimited by ;
:
id;first_name;second_name;age
0;Vitaliy;Povstenko;19
1;John;Doe;25
2;Bill;Gates;40
3;Elon;Musk;30
4;Don;Joel;25
The script allows you to convert from CSV to JSON in a similar way:
$python convertor.py --csv2json data.csv
data_converted.json
looks like:
[{"id": "0", "first_name": "Vitaliy", "second_name": "Povstenko", "age": "19"}...]
In default, script save JSON file without the indent, but you can specify indent in new JSON file:
$python convertor.py -cj data.csv -o names --json_indent 4
New names.json
contains:
[
{
"id": "0",
"first_name": "Vitaliy",
"second_name": "Povstenko",
"age": "19"
}
...
]
There are some cases when you need to know the schema of your Parquet file. For example:
$python convertor.py --get_schema data_converted.parquet
The script produces the following output:
id: int64
first_name: string
second_name: string
age: int64
Apache 2.0 License: www.apache.org/licenses/LICENSE-2.0 or see the LICENSE
file.