python hw_main.py
Note: Before run all homeworks, clear converted_data
, target_data
folders and delete tmp_data
folder if it exists.
We have 2 dataset:
- html-files with smartphone characteristics from DNS shop
- html-files with smartphone characteristics from Citilink shop
All raw html-files in raw_data
Here is converted raw data to target schema (html -> json). I use 2 MRjob steps (hw01_html_to_json.py
):
- Clear raw data (delete useless info, extract usefull parameters), use preparsing in
hw01_clear_html.py
, save inconverted_data
- Make json files in alone schema (now we have same attribute names), use my small regular expression in
hw01_my_regexp.py
, save intarget_data
python hw01_main_convert.py
Entity resolution: record linkage
Here, data from different sources is linked (via MapReduce) using a unique identifier. The input data is located in target_data
folder. The output data is located in er_data
folder.
python hw02_main.py
Data fusion
The input data is located in er_data
folder. Here, data linked by a common attribute is fused into one file fusion_data/fusion.json
via fuse_by
function. Data without link with other source is added in result file.
python hw03_main.py