Publication home: http://www.gks.ru/wps/wcm/connect/rosstat_main/rosstat/ru/statistics/publications/catalog/doc_1140087276688
- download files from web based on (year, month)
- extract cells form table as list of lists (row elements)
- map selected cell values to dictionaries with keys
name
,freq
,date
andvalue
- test dictionaries against reference values
- First three data columns contain information of interest:
"Август 2017г."
- this is publication month, August, returns absolute valuesв % к августу 2016 г.
- this is 'yoy' (year on year) rate of growthв % к июлю 2016 г.
- this is 'rog' (rate of growth) change
- We disregard the rest of data columns
Several rows contain data for different months other than August.
- Previous period from year start:
Валовой внутренний продукт, млрд.рублей
Инвестиции в основной капитал, млрд.рублей
- One month behind:
Внешнеторговый оборот, млрд.долларов США
and 2 subsequent lines
Table in Section 1 ("ОСНОВНЫЕ ЭКОНОМИЧЕСКИЕ И СОЦИАЛЬНЫЕ ПОКАЗАТЕЛИ"
) preserving
cell table structure
Header | Август 2017г. | В % к августу 2016 г. | В % к июлю 2017 г. |
---|---|---|---|
Валовой внутренний продукт, млрд.рублей | 41782,11) | 101,52) | |
Индекс промышленного производства4) | 101,5 | 102,0 | |
Продукция сельского хозяйства, млрд.рублей | 712,6 | 104,7 | 149,1 |
Lines 1 - we still need this data, but it is for date other than August, has comments and not a monthly frequency, so must be treated as a special case.
Lines 2 and 3 should read as:
[
dict(name='INDPRO_yoy', freq='m', date='2017-08', value=101.5),
dict(name='INDPRO_rog', freq='m', date='2017-08', value=102.0),
dict(name='AGROPROD_bln_rub', freq='m', date='2017-08', value=712.6),
dict(name='AGROPROD_yoy', freq='m', date='2017-08', value=104.7),
dict(name='AGROPROD_rog', freq='m', date='2017-08', value=149.1)
]
Best case: the solution works on Windows and Linux.
Second best: there two solutions, one on Windows and other one for Linux.
Current implementation: Windows + Word
pip install pdfminer.six
python D:\Continuum\Anaconda3\Scripts\pdf2txt.py -p 4 oper.pdf -t xml > oper.xml
- parse oper.xml next
-
on Windows may use https://github.com/mini-kep/parser-rosstat-kep/tree/master/src/word2csv
-
tech stack for parsing word at https://gist.github.com/epogrebnyak/252e5b568d58b7e9c635c2723d81c850