A Postgres-based k8s-infra data pipeline that produces BigQuery tables for reviewing Kubernetes Public artifact traffic in DataStudio.
It utilises Postgres for some local querying and the Postgres container image's init folder, where several shell scripts are sourced and SQL queries are run.
The process is split between multiple steps, the steps can be found in the ./pg-init.d
folder in this repo or the /docker-entrypoint-initdb.d
.
Some scripts call from the /app
folder, in the folder there are:
- some SQL to run against BigQuery
- some Python scripts for local data management
- a text file containing the names of buckets that container access logs (with
_usage
emitted)
The output is finally loaded in a DataStudio report and reviewed by members sig-k8s-infra without PII displayed.
- Kubernetes Public GCS buckets for artifacts are configured to write public access logs to a GCS bucket called
k8s-infra-artifacts-gcslogs
in thek8s-infra-public-pii
project - Data from ASN aggregators (PyASN, Potaroo) is collated in Postgres for matching ASN owners / Companies to IP ranges
- The ASN data is then loaded into BigQuery
- Publicly known company vendor ASNs from the k8s.io repo into BigQuery
- PeeringDB company information is fetched via their API and transformed into Postgres with the Potaroo data before being uploaded to BigQuery
- The Kubernetes Public GCS logs are then loaded into the usage_all_raw table of the BigQuery dataset
- Various tables are formed out of the previously loaded data in BigQuery, such as IPs and companies to join things together better
- The IPs are joined and expanded with the IP ranges from the ASN data, then loaded into the dataset as a table
- All the data is then linked into a single table
- Tables are then promoted to the stable ${GCP_BIGQUERY_DATASET} table for usage in DataStudio
- Wait until PID1's cmdline is
postgres
(meaning finished init) then halting Postgres cleanly withpg_ctl kill quit ${PID_FOR_POSTGRES}
, exiting 0
Name | Default | Description |
---|---|---|
TZ |
`` | Container time zone |
POSTGRES_PASSWORD |
postgres |
The password to set for Postgres |
GOOGLE_APPLICATION_CREDENTIALS |
`` | The path to the GCP service account json key |
GCP_PROJECT |
k8s-infra-ii-sandbox |
The project to target which hosts ${GCP_BIGQUERY_DATASET} and also will be billed |
GCP_BIGQUERY_DATASET |
etl_script_generated_set |
The dataset and basename to write to (appends date) |
NO_PROMOTE |
`` | Disable the promotion of ${GCP_BIGQUERY_DATASET}_${DATE} to ${GCP_BIGQUERY_DATASET} |
ASN_DATA_PIPELINE_RETAIN |
`` | Keeps Postgres running after the job has completed |
GCP_BQ_DUMP_BUCKET |
`` | A GCP bucket to dump content from BigQuery |
DEBUG_MODE |
`` | Toggles bash's debug mode |
Log into gcloud
gcloud auth login
Set the GCP project
gcloud config set project k8s-infra-ii-sandbox
Log into application-default
gcloud auth application-default login
Run in Docker
TMP_DIR_ETL=$HOME/.tmp/public-log-asn-matcher-$RANDOM
sudo mkdir -p $TMP_DIR_ETL
sudo chmod 0777 ${TMP_DIR_ETL}
sudo chown -R 999 ~/.config/gcloud # allow for postgres user
docker run \
-d \
-e DEBUG_MODE=true \
-e TZ=$TZ \
-e POSTGRES_PASSWORD="postgres" \
-e GCP_PROJECT=k8s-infra-ii-sandbox \
-e GCP_BIGQUERY_DATASET=etl_script_generated_set \
-e GCP_BQ_DUMP_BUCKET=ii_bq_scratch_dump \
-v $HOME/.config/gcloud:/var/lib/postgresql/.config/gcloud \
-v "${TMP_DIR_ETL}:/tmp" \
--name public-log-asn-matcher \
gcr.io/k8s-staging-infra-tools/public-log-asn-matcher
docker logs -f public-log-asn-matcher
Change permissions of ~/.config/gcloud back
sudo chown -R $(id -u) ~/.config/gcloud
A list of buckets is used for their .*_usage
bucket which stores the public access logs
./generate-buckets.sh
TAG="$(date -u '+%Y%m%d')-$(git describe --tags --always --dirty)"
PROJECT="k8s-infra-ii-sandbox"
BUCKET="ii_bq_scratch_dump"
gcloud builds submit \
--verbosity info \
--config cloudbuild.yaml \
--substitutions _TAG="${TAG}",_GIT_TAG="${TAG}" \
--project ${PROJECT} \
--gcs-source-staging-dir "gs://${BUCKET}/source" \
.