Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: pipeline to aggregate data from all years #100

Draft
wants to merge 22 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
d1bddff
feat: start pipeline to aggregate data
arnaldog12 Sep 5, 2022
09724e9
⚒️ chore: add profile command to makefile
arnaldog12 Sep 7, 2022
5472f40
🎲feat: base aggregate pipeline
arnaldog12 Sep 7, 2022
09c0f52
📆 feat: add year columns to preprocessed datasets
arnaldog12 Sep 9, 2022
3048587
🎲 feat: more clube ids
arnaldog12 Sep 19, 2022
398b803
chore: add Paraná to parameters.yml
arnaldog12 Oct 10, 2022
b12dbb6
Merge branch 'master' into feat/aggregate-pipeline
arnaldog12 Oct 21, 2022
4757009
chore: dependencies and remove unused tests
arnaldog12 Jun 15, 2024
2d3ff16
chore: remove black, isort and flake8
arnaldog12 Jun 15, 2024
6139ea2
chore: replace pandas_profiling by ydata_profiling
arnaldog12 Jun 15, 2024
db85c9c
fix: minor in parameters and data
arnaldog12 Jun 15, 2024
9dcaa56
feat: fix accumulated scouts in preprocessing
arnaldog12 Jun 15, 2024
c4e1961
feat: start schema validation with pandera
arnaldog12 Jun 15, 2024
e5301e7
test: start data tests in aggregated data
arnaldog12 Jun 15, 2024
098f384
feat: rename scouts columns
arnaldog12 Jun 16, 2024
fb1def8
feat: drop columns and convert types
arnaldog12 Jun 16, 2024
7efd2f4
feat: drop columns and improve schemas
arnaldog12 Jun 16, 2024
271ae38
feat: normalize player position
arnaldog12 Jun 16, 2024
94eebbb
test: convert_types node of aggregate pipeline
arnaldog12 Jun 16, 2024
30b4a38
test: improve test case
arnaldog12 Jun 16, 2024
d202e60
chore: improve coverage and schemas
arnaldog12 Jun 16, 2024
25bf634
tests: preprocessing nodes
arnaldog12 Jun 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
tests: preprocessing nodes
  • Loading branch information
arnaldog12 committed Jun 16, 2024
commit 25bf634d79609ab8335998b1e506b96ab6b33c7f
2 changes: 1 addition & 1 deletion src/cartola/pipelines/preprocessing/nodes.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,4 @@ def fix_accumulated_scouts(df: pd.DataFrame, dict_scouts: Dict[str, float]) -> p
df_round = get_disaccumulated_scouts_for_round(df, round_, cols_scouts)
df_result = pd.concat([df_result, df_round], ignore_index=True)

return df
return df_result
82 changes: 82 additions & 0 deletions src/tests/nodes/test_preprocessing_nodes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
import numpy as np
import pandas as pd

from cartola.pipelines.preprocessing.nodes import (
add_year_column,
fill_empty_slugs,
fill_scouts_with_zeros,
fix_accumulated_scouts,
map_posicao_to_string,
map_status_id_to_string,
)


def test_fill_scouts_with_zeros():
df = pd.DataFrame(dict(a=[np.nan, 2, np.nan]))
df = fill_scouts_with_zeros(df, dict(a=1.0))
assert ~df.a.isna().any()


def test_fill_empty_slugs():
data = dict(apelido=["Cristiano Ronaldo", "Messi"], slug=["CR7", np.nan])
df = pd.DataFrame(data)
df = fill_empty_slugs(df)
assert "slug" in df.columns
assert ~df.slug.isna().any()
assert df.slug.values[-1] == "messi"


def test_fill_empty_slug_with_no_slug_col():
df = pd.DataFrame(dict(apelido=["Cristiano Ronaldo", "Messi"]))
df = fill_empty_slugs(df)
assert "slug" in df.columns
assert ~df.slug.isna().any()


def test_map_status_id_to_string():
dict_map = {2: "Dúvida", 3: "Suspenso"}
df = pd.DataFrame(dict(status=[2, 3]))
df = map_status_id_to_string(df, dict_map)
assert df.status.isin(list(dict_map.values())).all()


def test_map_status_id_to_string_with_no_status_col():
df = pd.DataFrame()
df_res = map_status_id_to_string(df, {})
assert df.equals(df_res)


def test_map_posicao_to_string():
dict_map = {"1": "gol", "2": "lat"}
df = pd.DataFrame(dict(posicao=[1, 2]))
df = map_posicao_to_string(df, dict_map)
assert df.posicao.isin(list(dict_map.values())).all()


def test_add_year_columns():
df = pd.DataFrame()
df = add_year_column(df, year=2000)
assert "ano" in df.columns
assert np.all(df.ano.values == 2000)


def test_fix_accumulated_scouts():
dict_data = dict(
ano=[2015, 2015, 2015],
id_atleta=[1, 1, 1],
rodada=[1, 2, 3],
SG=[0, 0, 0],
CA=[0, 1, 1],
DE=[0, 1, 3],
)
df = pd.DataFrame(dict_data)
df = fix_accumulated_scouts(df, dict(SG=1, CA=1, DE=1))
assert np.all(df.SG.values == [0, 0, 0])
assert np.all(df.CA.values == [0, 1, 0])
assert np.all(df.DE.values == [0, 1, 2])


def test_fix_accumulated_scouts_in_year_without_accumulation():
df = pd.DataFrame(dict(ano=[2000]))
df_res = fix_accumulated_scouts(df, {})
assert df.equals(df_res)