GitHub - PetrovIgorA/BigData_hw: Homeworks for course "Information Extraction and Integration from Big Data"

HW, course "Information Extraction and Integration from Big Data"

Run all homeworks

python hw_main.py

Note: Before run all homeworks, clear converted_data, target_data folders and delete tmp_data folder if it exists.

HW 0

We have 2 dataset:

html-files with smartphone characteristics from DNS shop
html-files with smartphone characteristics from Citilink shop

All raw html-files in raw_data

HW 1

Here is converted raw data to target schema (html -> json). I use 2 MRjob steps (hw01_html_to_json.py):

Clear raw data (delete useless info, extract usefull parameters), use preparsing in hw01_clear_html.py, save in converted_data
Make json files in alone schema (now we have same attribute names), use my small regular expression in hw01_my_regexp.py, save in target_data

Run

python hw01_main_convert.py

HW 2

Entity resolution: record linkage

Here, data from different sources is linked (via MapReduce) using a unique identifier. The input data is located in target_data folder. The output data is located in er_data folder.

Run

python hw02_main.py

HW 3

Data fusion

The input data is located in er_data folder. Here, data linked by a common attribute is fused into one file fusion_data/fusion.json via fuse_by function. Data without link with other source is added in result file.

Run

python hw03_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HW, course "Information Extraction and Integration from Big Data"

Run all homeworks

HW 0

HW 1

Run

HW 2

Run

HW 3

Run

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
converted_data		converted_data
er_data		er_data
fusion_data		fusion_data
raw_data		raw_data
target_data		target_data
.gitignore		.gitignore
README.md		README.md
hw01_clear_html.py		hw01_clear_html.py
hw01_html_to_json.py		hw01_html_to_json.py
hw01_main_convert.py		hw01_main_convert.py
hw01_my_regexp.py		hw01_my_regexp.py
hw01_target_characteristics.txt		hw01_target_characteristics.txt
hw02_json_file.py		hw02_json_file.py
hw02_main.py		hw02_main.py
hw02_record_linkage.py		hw02_record_linkage.py
hw03_data_fusion.py		hw03_data_fusion.py
hw03_main.py		hw03_main.py
hw03_resolve.py		hw03_resolve.py
hw_base.py		hw_base.py
hw_main.py		hw_main.py

PetrovIgorA/BigData_hw

Folders and files

Latest commit

History

Repository files navigation

HW, course "Information Extraction and Integration from Big Data"

Run all homeworks

HW 0

HW 1

Run

HW 2

Run

HW 3

Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages