This repository contains scripts to download and process datasets from the USDA Food Data Central (FDC) for easy analysis and integration into other projects.
-
Preprocessing:
- The preprocessing pipeline consists of three scripts, each handling a specific data type (Foundation Foods, SR Legacy Foods, and Branded Foods).
- Within each script:
- Data is downloaded and read in from URLs gathered in
main.py
. - Only relevant dataframes and columns are kept, unless the
keep_files
flag is specified in arguments. - Data is cleaned, merged, aggregated, supplemented, and saved as an intermediary Parquet file.
- Data is downloaded and read in from URLs gathered in
-
Data Stacking:
- Upon completion of individual processing, intermediary Parquet files are read into
main.py
. - The data is stacked together.
- Upon completion of individual processing, intermediary Parquet files are read into
-
Postprocessing:
- The concatenated datasets are then postprocessed and finalized by:
- Resetting the indices
- Setting data types
- Filling missing values
- Cleaning string data
- The concatenated datasets are then postprocessed and finalized by:
-
Cleanup and Save:
- Any remaining files other than the complete, processed data are deleted unless the
keep_files
flag is specified in arguments. - The finalized, stacked dataset is saved within the output directory as a CSV file.
- Any remaining files other than the complete, processed data are deleted unless the
-
- Python 3.6 or later
- pandas
- requests
- pyarrow
- beautifulsoup4
- ingredient-slicer
-
-
Clone the repository:
git clone https://github.com/mkayeterry/usda-fdc-data
-
Navigate to the repository directory:
cd usda-fdc-data
-
Run the main script:
python3 main.py
-
-
--output_dir
: Specify the output directory path (default:fdc_data
).
--filename
: Specify output filename (default:usda_food_nutrition_data.csv
).
--keep_files
: Keep raw and individual files after processing (warning: files are large).python3 main.py --output_dir data -- filename data.csv --keep_files
-
The processed USDA output data is standardized and contains the following information:
fdc_id
: The unique identifier assigned to each food item within the USDA Food Data Central.usda_data_source
: Indicates the source of the food item, denoting the specific downloaded file it originated from.data_type
: Describes the type of data associated with the food item, including branded, foundation, or sr_legacy.category
: The category or type of food.brand_owner
: The owner or manufacturer of the brand, for branded_foods only.brand_name
: The name of the brand, for branded_foods only.food_description
: A description of the food item.food_common_name
: Name food is most commonly known by, for foundation foods only.food_common_category
: Category food is most commonly known to be in, for foundation foods onlyingredients
: The ingredients used in the food item, for branded_foods only.
portion_amount
: The amount of the food item in the portion.portion_unit
: The unit of measurement for the portion.portion_modifier
: Any modifier applied to the portion (such as "large" or "1/8 of crust").std_portion_amount
: Standardized portion amount, derived from the combination of portion_amount, portion_unit, and portion_modifier (i.e. 'one' --> 1).std_portion_unit
: Standardized portion unit, derived from the combination of portion_amount, portion_unit, and portion_modifier (i.e. 'oz' --> 'ounces').portion_gram_weight
: The weight of the portion in grams.portion_energy
: The energy content in calories per portion.
energy
: The energy content per gram of the food item.protein
: The protein content per gram of the food item.total_lipid_fat
: The total lipid (fat) content per gram of the food item.carbohydrate_by_difference
: The carbohydrate content per gram of the food item.
calcium_ca
: Calciumiron_fe
: Ironmagnesium_mg
: Magnesiumphosphorus_p
: Phosphoruspotassium_k
: Potassiumsodium_na
: Sodiumzinc_zn
: Zinccopper_cu
: Coppermanganese_mn
: Manganeseselenium_se
: Selenium
vitamin_a_rae
: Vitamin Avitamin_c_total_ascorbic_acid
: Vitamin Cvitamin_e_alphatocopherol
: Vitamin Evitamin_k_phylloquinone
: Vitamin Kthiamin
: Thiamin (Vitamin B1)riboflavin
: Riboflavin (Vitamin B2)niacin
: Niacin (Vitamin B3)vitamin_b6
: Vitamin B6folate_total
: Folatevitamin_b12
: Vitamin B12vitamin_d3_cholecalciferol
: Vitamin D3vitamin_d2_ergocalciferol
: Vitamin D2pantothenic_acid
: Pantothenic Acid (Vitamin B5)vitamin_k_dihydrophylloquinone
: Vitamin K1 (Dihydrophylloquinone)vitamin_k_menaquinone4
: Vitamin K2 (Menaquinone-4)carotene_beta
: Beta-Caroteneretinol
: Retinol (Vitamin A1)
tryptophan
,threonine
,methionine
,phenylalanine
,tyrosine
,valine
,arginine
,histidine
,isoleucine
,leucine
,lysine
,cystine
,alanine
,glutamic_acid
,glycine
,proline
,serine
sucrose
,glucose
,maltose
,fructose
,lactose
,galactose
choline_total
: Total Cholinebetaine
: Betaine
Data for this processing project was obtained from the USDA FoodData Central (FDC) website.
This project is licensed under the MIT License - see the LICENSE file for details.