Replies: 3 comments
-
Hi @botanikus. First, thanks for such a detailed explanation of your issues. There is nothing new in this response from our email discussion. We will do our best to explain so that others may benefit.
The core issue has to do with very high $skip values. We are aware of this problem and are working on a solution. As you noted, reducing the records returned with the $top parameter will have no effect. The short-term solution is to add a $filter that limits data further. If you are filtering on fields other than dates or state and results are still slow, contact OpenFEMA. The solution may be as simple as indexing fields in the OpenFEMA data store.
|
Beta Was this translation helpful? Give feedback.
-
Hi @lukelamar. Thank you very much for the extensive response on the issue. I wish to add that I have found a way to easily split the large CSV and use subsets of it on low memory devices. Sadly, I was not able to use the provided Parquet (now it has been removed alltogether). I found an alternative solution some time ago and wished to update here. Working with the large NFIP Policies dataset on low-ram, slow-internet speed devicesThe library bread by user @MagicHead99 was pivotal for this purpose as the creator states "bread functions allow to analyze a 50Gb file with a computer with 8Gb of memory." The following code is a sample to read the large policies dataset, filter for a desired state (here Florida) and save it as CSV. I later created a loop that generated me a CSV for each state in the dataset (get a list of all unique states in the column propertyState and the create a loop with the code below). Later, I converted the CSVs into FST (using fst library for "read_fst", "write_fst") which significantly speeds up reading and writing speeds. This entire process took me about an hour at most. I might add that using Python and having 32gb RAM, I was not able to handle this process before. I hope this solution helps others with low memory, slow internet, and also helps free up your server capacity. Here is my code:
|
Beta Was this translation helpful? Give feedback.
-
Hello everyone,
I’ve been working with the FIMA NFIP Policies dataset mentioned to have around 65 million rows as per the documentation. Presently, the CSV mentioned here has about ~15GB in size. However, I’m encountering significant discrepancies and challenges:
In an attempt to circumvent these issues, I’ve written a Python script to directly download data through the FEMA API. However, this approach also has its own set of challenges, given that I need to download about ~19 million rows for Florida alone:
Here's the link to the dataset for reference: FEMA NFIP Policies Dataset.
Below is the Python code I wrote using the online documentation for downloading the data, it also checks if the retrieved parquet file is not empty, to note, also filetype CSV, JSON, and other resulted in the same problems:
Does anyone have suggestions on either fixing the dublicate issue in the API or, even better, how to correctly read or fix the CSV issue as here I do not have to deal with any API calling at all (fastest solution possibly)?
Thank you very much!
edit: Thank you very much for providing and maintaining this dataset!
edit: I have opened the ~15GB file using "Large Text File Viewer" the last row is set at 38,155,202 as the screenshot depicts, whilst online (and through the API record call) I am being told of the dataset containing 65,022,658 rows.
edit: using the API, I downloaded ~3 Million rows so far, of these ~500k are dublicate values. Occurs for parquet, json, csv files. Is there a way that OpenFema provides the Policies file (with presumably ~65 Million rows) in seperate files instead of a single large CSV file (which apparantly only holds 38 Million rows)?
edit: after reaching out to OpenFEMA, I was informed of a server side issue that lead to a corrupted CSV file and missing rows, I have now received the final file and am happy to report that all rows are present. Adding an $orderby to the API call does not work at larger skip counts (over a million), leading to a breaking connection (503). Thank you again for your quick help and response!
Beta Was this translation helpful? Give feedback.
All reactions