Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"check" does not consider extras and allow environments not possible to re-create with new resolver #9299

Closed
potiuk opened this issue Dec 15, 2020 · 15 comments
Labels
C: check Checking dependency graph for consistency type: bug A confirmed bug or unintended behavior

Comments

@potiuk
Copy link
Contributor

potiuk commented Dec 15, 2020

I tried installing airflow 2.0.0rc3 with the new master version of PIP (to be 20.3.2).

When I run the same installation using 20.2.4 the whole install takes about 20 seconds when I already pre-installed some of the prerequisites and it produces consistent installation without any conflicts:

Successfully installed apache-airflow-2.0.0rc3 apache-airflow-providers-amazon-1.0.0 apache-airflow-providers-apache-cassandra-1.0.0 apache-airflow-providers-apache-druid-1.0.0 apache-airflow-providers-apache-hdfs-1.0.0 apache-airflow-providers-apache-hive-1.0.0 apache-airflow-providers-apache-kylin-1.0.0 apache-airflow-providers-apache-livy-1.0.0 apache-airflow-providers-apache-pig-1.0.0 apache-airflow-providers-apache-pinot-1.0.0 apache-airflow-providers-apache-spark-1.0.0 apache-airflow-providers-apache-sqoop-1.0.0 apache-airflow-providers-celery-1.0.0 apache-airflow-providers-cloudant-1.0.0 apache-airflow-providers-cncf-kubernetes-1.0.0 apache-airflow-providers-databricks-1.0.0 apache-airflow-providers-datadog-1.0.0 apache-airflow-providers-dingding-1.0.0 apache-airflow-providers-discord-1.0.0 apache-airflow-providers-docker-1.0.0 apache-airflow-providers-elasticsearch-1.0.0 apache-airflow-providers-exasol-1.0.0 apache-airflow-providers-facebook-1.0.0 apache-airflow-providers-ftp-1.0.0 apache-airflow-providers-google-1.0.0 apache-airflow-providers-grpc-1.0.0 apache-airflow-providers-hashicorp-1.0.0 apache-airflow-providers-http-1.0.0 apache-airflow-providers-imap-1.0.0 apache-airflow-providers-jdbc-1.0.0 apache-airflow-providers-jenkins-1.0.0 apache-airflow-providers-jira-1.0.0 apache-airflow-providers-microsoft-azure-1.0.0 apache-airflow-providers-microsoft-mssql-1.0.0 apache-airflow-providers-microsoft-winrm-1.0.0 apache-airflow-providers-mongo-1.0.0 apache-airflow-providers-mysql-1.0.0 apache-airflow-providers-odbc-1.0.0 apache-airflow-providers-openfaas-1.0.0 apache-airflow-providers-opsgenie-1.0.0 apache-airflow-providers-oracle-1.0.0 apache-airflow-providers-pagerduty-1.0.0 apache-airflow-providers-papermill-1.0.0 apache-airflow-providers-plexus-1.0.0 apache-airflow-providers-postgres-1.0.0 apache-airflow-providers-presto-1.0.0 apache-airflow-providers-qubole-1.0.0 apache-airflow-providers-redis-1.0.0 apache-airflow-providers-salesforce-1.0.0 apache-airflow-providers-samba-1.0.0 apache-airflow-providers-segment-1.0.0 apache-airflow-providers-sendgrid-1.0.0 apache-airflow-providers-sftp-1.0.0 apache-airflow-providers-singularity-1.0.0 apache-airflow-providers-slack-1.0.0 apache-airflow-providers-snowflake-1.0.0 apache-airflow-providers-sqlite-1.0.0 apache-airflow-providers-ssh-1.0.0 apache-airflow-providers-telegram-1.0.0 apache-airflow-providers-vertica-1.0.0 apache-airflow-providers-yandex-1.0.0 apache-airflow-providers-zendesk-1.0.0
WARNING: You are using pip version 20.2.4; however, version 20.3.1 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
root@5fea5c1105cb:/opt/airflow# pip check
No broken requirements found.
root@5fea5c1105cb:/opt/airflow# 

With PIP 20.3.2@master it takes about 30 minutes and fails producing "Cannot install" error:

ERROR: Cannot install google-cloud-bigquery[bqstorage,pandas]==2.4.0, nteract-scrapbook==0.4.1, nteract-scrapbook[all]==0.4.1 and papermill[all]==2.2.2 because these package versions have conflicting dependencies.

The conflict is caused by:
    nteract-scrapbook[all] 0.4.1 depends on pyarrow
    nteract-scrapbook 0.4.1 depends on pyarrow
    papermill[all] 2.2.2 depends on pyarrow; extra == "all"
    google-cloud-bigquery[bqstorage,pandas] 2.4.0 depends on pyarrow<3.0dev and >=1.0.0; extra == "bqstorage"

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

The whole installation log is attached.
pip-20-3-2-master.txt

Installation environment:

Latest official Python 3.6 image with airflow dependencies (based on debian buster):

docker pull apache/airflow:master-python3.6-ci

Installation method:

pip install apache-airflow[devel_ci]==2.0.0rc3 --constraint https://raw.githubusercontent.com/apache/airflow/constraints-master/constraints-3.6.txt

The easiest way to reproduce:

git clone https://github.com/apache/airflow.git
./breeze --install-airflow-version 2.0.0rc3 --skip-mounting-local-sources
# While in the container

# uninstall airflow and all 60 providers
pip uninstall apache-airflow -y
pip freeze  |grep airflow-providers |xargs pip uninstall -y

# Install Pip from master
# NOTE! Skip this step if you want to compare 2.20.4 - this is the default version we have in the image)
pip install git+https://github.com/pypa/pip@master

# install airflow in recommended way
pip install apache-airflow[devel_ci]==2.0.0rc3 --constraint https://raw.githubusercontent.com/apache/airflow/constraints-master/constraints-3.6.txt

Related issues: #9298 #9297

@pradyunsg
Copy link
Member

What version of pyarrow did you expect to have installed?

@uranusjr
Copy link
Member

According to the constraints file, pyarrow==0.17.1.

@potiuk
Copy link
Contributor Author

potiuk commented Dec 15, 2020

Yep. With PIP 20.2.4:

pip freeze  | grep pyarrow
pyarrow==0.17.1

And just to remind how it works - those constraints are fully automatically generated using pip freeze (using pip 20.2.4). I do not maintain them manually.

So this just pip 20.2.4 telling pip 20.3 to install 0.17.1 in this case based on all the requirements of dependent packages. This is nothing we added from Airflow side.

So it's either PIP 20.2.4 that is wrong, or PIP 20.3.

@potiuk
Copy link
Contributor Author

potiuk commented Dec 15, 2020

Just to explain the way how we generate those constraints (we only use standard PIP tooling from 20.2.4):

  1. We run pip (20.2.4) eager upgrade
pip install .[devel_ci] --upgrade --upgrade-strategy eager
  1. We run pip check:
pip check
  1. We run 4000 + unit tests

  2. When all the above passes we run the below and commit changes to the constraints file:

pip freeze | sort | \
    grep -v "apache_airflow" | \
    grep -v "@" | \
    grep -v "/opt/airflow" >"${CURRENT_CONSTRAINT_FILE}"

@uranusjr
Copy link
Member

So I think the questions here for us are

  1. Does pyarrow==0.17.1 actually work in this dependency graph?
  2. If it does, why did the new resolver fail to find a solution?
  3. If it does not, why did pip check not catch the conflict?

@potiuk
Copy link
Contributor Author

potiuk commented Dec 15, 2020

I think we gave you one of the most complex cases to look at ;)

And with breeze (the dev environment I introduced to Airflow) it is super easy to introduce - it takes literally few minutes to clone repo/download the image and everything runs in controlled environment inside.

@potiuk
Copy link
Contributor Author

potiuk commented Dec 15, 2020

The problem seems to be with transitive extras in pandas-bgq. It requires google-cloud-bigquery[bqstorage,pandas] and it's the google-cloud-bigquery[pandas] that has pyarrow >1.0.0 dep

It seems that PIP 20.2.4 lost the transitive extras in google-cloud-bigquery

But this is a pure guess.

@uranusjr uranusjr added the type: bug A confirmed bug or unintended behavior label Dec 15, 2020
@uranusjr
Copy link
Member

uranusjr commented Dec 15, 2020

Marking this as a bug because there is likely a bug somewhere. The bug is not necessarily in the new resolver, but the dependency resolution tag makes this easier to find.

@uranusjr uranusjr added the C: dependency resolution About choosing which dependencies to install label Dec 15, 2020
@potiuk
Copy link
Contributor Author

potiuk commented Dec 15, 2020

I have some GOOD NEWS this time.

I did some experiments and manually upgraded the pyarrow version to 2.0.0 (despite eager upgrade in 20.2.4 not upgrading it). and ..... suddenly PIP@Master seems to work!

root@5fea5c1105cb:/opt/airflow# pip check
No broken requirements found.
root@5fea5c1105cb:/opt/airflow# 

I do not see any any backtracking and it seems to be reasonably fast. Just repeated a fresh install (I uninstalled all installed packages before) - and suddenly it works. It seems with pyarrow 2.0.0 also PIP 20.2.4 works (even if with eager update strategy it did not upgrade pyarrow before). So indeed it looks like a bug with pip 20.2.4 that generated bad constraints and caused the new PIP to go haywire.

I will manually update all the constraints now.

@uranusjr
Copy link
Member

I investigated into 3. pip check did not catch the conflict because pyarrow <3.0dev,>=1.0.0 is only specified by google-cloud-bigquery in extras (three exras pandas, bqstorage, and all contain this), and pip check does not follow extras (because pip does not record them in the first place). So I’d guess the new resolver is correct, and there really isn’t a solution with pyarrow==0.17.1.

How we could improve pip check though, is a whole big topic…

@uranusjr uranusjr added C: check Checking dependency graph for consistency and removed C: dependency resolution About choosing which dependencies to install labels Dec 15, 2020
@uranusjr uranusjr changed the title PIP install 20.3.2@master fails to install airflow 2.0.0rc3 "check" does not consider extras and allow environments not possible to re-create with new resolver Dec 15, 2020
@potiuk
Copy link
Contributor Author

potiuk commented Dec 15, 2020

I manually upgraded all our constraints for master/2-0/2.0.0rc3 and we will get it also for 2.0. So it looks like we have a working solution before our release - day after tomorrow voting ends - as long PIP 20.3.2 gets released. Still for 1.10.14 it's not possible (we have pyarrow < 1.0.0 in one of the deps there - but this is not as big of an issue as with 2.0 instlall.

While this is a little brittle (if one bad requirement can cause such a difference in behaviour), with our automated checks and pushing the full set of constraints when everything installs and tests pass, as soon as 20.3.2 gets released we can upgrade to it in our CI pipeline and keep it all under control. Once we run it for a while we can update our release docs and remove the 'downgrade to 20.2.4" warning then (at least for 2.0).

Do you know when the 20.3.2 release can be expected @uranusjr ?

@pradyunsg
Copy link
Member

20.3.3, later today. :)

@potiuk
Copy link
Contributor Author

potiuk commented Dec 15, 2020

After manual updating of the PyArrow and testing PIP 20.3.3 it seems that Airflow installs now for 2.0.0rc3 with even complex combinations of extras. This clears up the path of releasing 2.0.0 - hopefully the day after tomorrow - without worrying about our user's experience.

We still have a problematic extra (papermilll) that cannot be installed wit the new PyPI using the recommended method in 1.10.14 (current stable version) - but this is rarely used dependency and since we have a workaround (legacy resolver or PIP downgrading) this is not a high priority. We will likely release 1.10.15 in the coming weeks likely, and we will try to fix it then.

We lowered priority/description of the issue apache/airflow#12838 and we will close it once our users confirm that all works for them after we release 2.0.0. We will also add 20.3.+ to our CI pipeline in the coming days so that we can test it all automatically also for future upgrades to PIP.

Thanks for the fast reaction and resolution!

@potiuk
Copy link
Contributor Author

potiuk commented Dec 15, 2020

Feel free to close the issue.

@uranusjr
Copy link
Member

Closing this since the check issue is tracked in #4086.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C: check Checking dependency graph for consistency type: bug A confirmed bug or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants