Data sources include:
- egrul folder with
.csv
files listing all russian companies. Loaded for each region from Nalog.ru; - rosstat folder with
.xlsx
files aggregating regional statistics from Rosstat.gov.ru. Version 2022; - msp folder with
.xlsx
files from Nalog.ru. Includes small and middle companies only. HERE edit the version date used (see this parameter at mentioned page); - msp_xml folder with
.xml
files from Nalog.ru. Data very similar to previous one, but with different representations and few extra features included. - features.xlsx file created by hand. It lists all regional features used in analysis.
Further processing saves intermidiate files to data
folder.
Config
folder stores configurations for data and models' parameters. These parameters were obtained with Optuna optimization (not included).
From repo folder run:
docker build -t stat .
docker run -it -v <CODE FOLDER>:/workdir -v <DATA FOLDER>:/workdir/ -m 16000m --cpus=4 -w="/workdir" stat
There are 2 options:
- Inside the container run
.sh
(not implemented yet) with raw company data (.xls
,.xlsx
,.csv
file formats) from google drive, unpack it, delete the archived data. - Inside the container run
sh download.sh
-- to download preprocessed company data (.parquet
file format) from google drive, unpack it, delete the archived data.
- preprocess.py handles raw data. Thus, it works after loading data with option 1. Output file
data/parquet/companies_feat.parquet
contains all companies mentioned in MSP registry and closed up to date (i.e. companies with finite 'lifetime' feature serving as a target variable). This step requiresdata_raw
folder. However you may skip it and use.parquet
files loaded to folderdata
with option 2. - train.py performs the regression analysis with several algorythms and writes pretrained models and their metrics in
data/models/metrics.parquet
file. - run.py (not implemented yet) predicts lifetime for a company with parameters listed in
config\predict.yaml