Data sources

Data sources include:

egrul folder with .csv files listing all russian companies. Loaded for each region from Nalog.ru;
rosstat folder with .xlsx files aggregating regional statistics from Rosstat.gov.ru. Version 2022;
msp folder with .xlsx files from Nalog.ru. Includes small and middle companies only. HERE edit the version date used (see this parameter at mentioned page);
msp_xml folder with .xml files from Nalog.ru. Data very similar to previous one, but with different representations and few extra features included.
features.xlsx file created by hand. It lists all regional features used in analysis.

Further processing saves intermidiate files to data folder.

Config folder stores configurations for data and models' parameters. These parameters were obtained with Optuna optimization (not included).

Docker

From repo folder run:

docker build -t stat .
docker run -it -v <CODE FOLDER>:/workdir -v <DATA FOLDER>:/workdir/ -m 16000m --cpus=4 -w="/workdir" stat

There are 2 options:

Inside the container run .sh (not implemented yet) with raw company data (.xls, .xlsx, .csv file formats) from google drive, unpack it, delete the archived data.
Inside the container run sh download.sh -- to download preprocessed company data (.parquet file format) from google drive, unpack it, delete the archived data.

preprocess.py handles raw data. Thus, it works after loading data with option 1. Output file data/parquet/companies_feat.parquet contains all companies mentioned in MSP registry and closed up to date (i.e. companies with finite 'lifetime' feature serving as a target variable). This step requires data_raw folder. However you may skip it and use .parquet files loaded to folder data with option 2.
train.py performs the regression analysis with several algorythms and writes pretrained models and their metrics in data/models/metrics.parquet file.
run.py (not implemented yet) predicts lifetime for a company with parameters listed in config\predict.yaml