Low-Memory Dataset Handling & Data Issues with NFIP Policies Dataset (CSV & API) #9

botanikus · 2024-01-25T14:39:33Z

botanikus
Jan 25, 2024

Hello everyone,

I’ve been working with the FIMA NFIP Policies dataset mentioned to have around 65 million rows as per the documentation. Presently, the CSV mentioned here has about ~15GB in size. However, I’m encountering significant discrepancies and challenges:

Row Count Discrepancies: Tools like C compiler (to split the file), Python Polars, Power BI, Excel Power Query, and SQL Server Management Studio are all showing roughly 38 million rows instead of the expected 65 million.
When I checked the propertyState field, I encountered more than 60 values, including numericals, which make me believe there must be a delimiter issue at some point in the CSV causing missing linebreaks?
Errors in Power Query (Excel): When loading the data in Excel, I am alerted to various errors within the file for almost all columns, emphasizing the previous hypothesis of a corrupt CSV.
Data Extraction Issues: Specifically, extracting accurate and complete data for regions like Florida is proving to be problematic due to the above inconsistencies and errors. Southern states are my primary interest.

In an attempt to circumvent these issues, I’ve written a Python script to directly download data through the FEMA API. However, this approach also has its own set of challenges, given that I need to download about ~19 million rows for Florida alone:

API Instability: The API occasionally breaks, interrupting the download process. I reduced the query count from 10k to 1k in an attempt to fix this. Sadly it did not work.
Duplicate IDs in Downloaded Data: After downloading approximately 2 million rows, I found around 150,000 duplicate IDs. This suggests that some downloads are pulling incorrect rows despite the API call URLs being correctly formatted.

Here's the link to the dataset for reference: FEMA NFIP Policies Dataset.

Below is the Python code I wrote using the online documentation for downloading the data, it also checks if the retrieved parquet file is not empty, to note, also filetype CSV, JSON, and other resulted in the same problems:

import os
import re
import time
import requests


directory = 'Florida_Policies_Parquet_New'


# as the code ocasionally breaks through an empty response package, I recheck the last downloaded file and skip the row count in my next call
def extract_record_count(filename):
    match = re.search(r'FimaNfipPolicies_FL_(\d+).parquet$', filename)
    return int(match.group(1)) if match else 0


base_url = "https://www.fema.gov/api/open/v2/FimaNfipPolicies?"

downloaded_files = [f for f in os.listdir(directory) if f.endswith('.parquet')]

highest_count = max([extract_record_count(f) for f in downloaded_files], default=0)

print(f'Stopped at file: FimaNfipPolicies_FL_{highest_count}.parquet. Skipping {highest_count} to download next 10K.')

while True:
    params = {
        "$filter": "propertyState eq 'FL'",
        "$format": "parquet",
        "$top": "1000",
        "$skip": str(highest_count)
    }

    response = requests.get(base_url, params=params)
    print(f"Sending request: {response.url}")

    if response.status_code == 200 and len(response.content) > 100:
        highest_count += 1000
        filename = f'FimaNfipPolicies_FL_{highest_count}.parquet'
        filepath = os.path.join(directory, filename)
        with open(filepath, 'wb') as file:
            file.write(response.content)
        print(f"Downloaded data. Total records processed: {highest_count}")
        print(f"Data successfully saved to {filepath}")
        time.sleep(5)

    else:
        print(f"Failed to download data. Status code: {response.status_code}")
        break

Does anyone have suggestions on either fixing the dublicate issue in the API or, even better, how to correctly read or fix the CSV issue as here I do not have to deal with any API calling at all (fastest solution possibly)?

Thank you very much!

edit: Thank you very much for providing and maintaining this dataset!

edit: I have opened the ~15GB file using "Large Text File Viewer" the last row is set at 38,155,202 as the screenshot depicts, whilst online (and through the API record call) I am being told of the dataset containing 65,022,658 rows.

edit: using the API, I downloaded ~3 Million rows so far, of these ~500k are dublicate values. Occurs for parquet, json, csv files. Is there a way that OpenFema provides the Policies file (with presumably ~65 Million rows) in seperate files instead of a single large CSV file (which apparantly only holds 38 Million rows)?

edit: after reaching out to OpenFEMA, I was informed of a server side issue that lead to a corrupted CSV file and missing rows, I have now received the final file and am happy to report that all rows are present. Adding an $orderby to the API call does not work at larger skip counts (over a million), leading to a breaking connection (503). Thank you again for your quick help and response!

lukelamar · 2024-01-30T22:10:14Z

lukelamar
Jan 30, 2024
Maintainer

Hi @botanikus. First, thanks for such a detailed explanation of your issues. There is nothing new in this response from our email discussion. We will do our best to explain so that others may benefit.

The file for this timeframe failed to fully upload and was therefore corrupt between 1/21/2024 and 1/25/2024. This is most likely the cause for your file inconsistencies mentioned in your first paragraph. The file has been regenerated and uploaded.
Downloading via API. A paging technique using the API can be problematic for this dataset due to its size. Without sufficient limiting criteria (something that narrows the results to about 5 million records), the process can significantly slow down with a 503 response being returned sometimes. Sometimes adding a retry mechanism will eliminate the 503 error - especially if the server is under heavy load from other users.

The core issue has to do with very high $skip values. We are aware of this problem and are working on a solution. As you noted, reducing the records returned with the $top parameter will have no effect.

The short-term solution is to add a $filter that limits data further. If you are filtering on fields other than dates or state and results are still slow, contact OpenFEMA. The solution may be as simple as indexing fields in the OpenFEMA data store.

Duplicates. If retrieving data from the API using multiple calls, it is important to provide an $orderby parameter. There is no guaranteed order that is returned from the API. Without an $orderby argument, records pulled in earlier calls may be returned again. We suggest providing a default order by using "$orderby=id".
Multiple files. We know this dataset is large and using it is challenging, however we will probably not break the file into smaller chunks in the future. We have added a Parquet version of the file that is dramatically smaller (2.1 GB vs. the 27 GB CSV file). Links will appear on the dataset pages soon. We are also exploring methods to allow filtered downloads without having to page through data.

0 replies

botanikus · 2024-02-20T12:18:35Z

botanikus
Feb 20, 2024
Author

Hi @lukelamar. Thank you very much for the extensive response on the issue. I wish to add that I have found a way to easily split the large CSV and use subsets of it on low memory devices. Sadly, I was not able to use the provided Parquet (now it has been removed alltogether). I found an alternative solution some time ago and wished to update here.

Working with the large NFIP Policies dataset on low-ram, slow-internet speed devices

The library bread by user @MagicHead99 was pivotal for this purpose as the creator states "bread functions allow to analyze a 50Gb file with a computer with 8Gb of memory."

The following code is a sample to read the large policies dataset, filter for a desired state (here Florida) and save it as CSV. I later created a loop that generated me a CSV for each state in the dataset (get a list of all unique states in the column propertyState and the create a loop with the code below). Later, I converted the CSVs into FST (using fst library for "read_fst", "write_fst") which significantly speeds up reading and writing speeds.

This entire process took me about an hour at most. I might add that using Python and having 32gb RAM, I was not able to handle this process before. I hope this solution helps others with low memory, slow internet, and also helps free up your server capacity.

Here is my code:

library(bread)
library(data.table)

bfilter(file = FimaNfipPolicies.csv,
        patterns = "FL",
        filtered_columns = "propertyState",
        fixed = TRUE) -> filtered_data

fwrite(filtered_data, file = "PL_Florida.csv", sep = ",", col.names = TRUE)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low-Memory Dataset Handling & Data Issues with NFIP Policies Dataset (CSV & API) #9

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Low-Memory Dataset Handling & Data Issues with NFIP Policies Dataset (CSV & API) #9

botanikus Jan 25, 2024

Replies: 3 comments

lukelamar Jan 30, 2024 Maintainer

botanikus Feb 20, 2024 Author

Working with the large NFIP Policies dataset on low-ram, slow-internet speed devices

botanikus
Jan 25, 2024

lukelamar
Jan 30, 2024
Maintainer

botanikus
Feb 20, 2024
Author