This project demonstrates various processing approaches for handling large Xml files, using the publicly available LEI2 dataset.
See Data for scripts used for downloading the Data from the public API endpoint, and uploading the files to the Data Lake Store.
See ConsoleApp for the .NET Core CommandLineApp as a starting point for deriving some smaller files off the big Xml and lots of valuable .NET methods for Reading, Writing, Validating, Serializing Xml.
See FunctionApp for the Functions which are hosting the .NET Core snippets and which are (mostly) triggered by new Blob events
See Database for the T-SQL approaches
See Databricks for the Spark approaches
See DataLake for the U-SQL approaches -> outdated approach, use FunctionApp & Databricks instead
The Global Legal Entity Identifier Foundation (GLEIF) is tasked to support the implementation and use of the Legal Entity Identifier (LEI). The Legal Entity Identifier (LEI) enables clear and unique identification of legal entities engaging in financial transactions.
LEI data is a good open data source for demonstrating multi GB Xml-handling, while working with a valuable dataset. That is, because we believe in working software over comprehensive documentation
Run Download-LEI2.ps1 for downloading the 155 mb Zip file and extracting the 2.6 GB Xml file
About the LEI data format: LEI Level 1 data CDF v2.1
About downloading the contatenated files: gleif.org/gleif-concatenated-file