# Public log asn matcher A Postgres-based k8s-infra data pipeline that produces BigQuery tables for reviewing Kubernetes Public artifact traffic in DataStudio. It utilises Postgres for some local querying and the Postgres container image's init folder, where several shell scripts are sourced and SQL queries are run. The process is split between multiple steps, the steps can be found in the `./pg-init.d` folder in this repo or the `/docker-entrypoint-initdb.d`. Some scripts call from the `/app` folder, in the folder there are: - some SQL to run against BigQuery - some Python scripts for local data management - a text file containing the names of buckets that container access logs (with `_usage` emitted) The output is finally loaded in a DataStudio report and reviewed by members [sig-k8s-infra](https://github.com/kubernetes/community/blob/master/sig-k8s-infra/README.md) without PII displayed. ## Flow 1. Kubernetes Public GCS buckets for artifacts are configured to write public access logs to a GCS bucket called `k8s-infra-artifacts-gcslogs` in the `k8s-infra-public-pii` project 2. Data from ASN aggregators (PyASN, Potaroo) is collated in Postgres for matching ASN owners / Companies to IP ranges 3. The ASN data is then loaded into BigQuery 4. Publicly known company vendor ASNs from the k8s.io repo into BigQuery 5. PeeringDB company information is fetched via their API and transformed into Postgres with the Potaroo data before being uploaded to BigQuery 6. The Kubernetes Public GCS logs are then loaded into the _usage_all_raw_ table of the BigQuery dataset 7. Various tables are formed out of the previously loaded data in BigQuery, such as IPs and companies to join things together better 8. The IPs are joined and expanded with the IP ranges from the ASN data, then loaded into the dataset as a table 9. All the data is then linked into a single table 10. Tables are then promoted to the stable _${GCP_BIGQUERY_DATASET}_ table for usage in DataStudio 11. Wait until PID1's cmdline is `postgres` (meaning finished init) then halting Postgres cleanly with `pg_ctl kill quit ${PID_FOR_POSTGRES}`, exiting 0 ## Environment variables | Name | Default | Description | | -------------------------------- | -------------------------- | ------------------------------------------------------------------------------------- | | `TZ` | `` | Container time zone | | `POSTGRES_PASSWORD` | `postgres` | The password to set for Postgres | | `GOOGLE_APPLICATION_CREDENTIALS` | `` | The path to the GCP service account json key | | `GCP_PROJECT` | `k8s-infra-ii-sandbox` | The project to target which hosts `${GCP_BIGQUERY_DATASET}` and also will be billed | | `GCP_BIGQUERY_DATASET` | `etl_script_generated_set` | The dataset and basename to write to (appends date) | | `NO_PROMOTE` | `` | Disable the promotion of `${GCP_BIGQUERY_DATASET}_${DATE}` to ${GCP_BIGQUERY_DATASET} | | `ASN_DATA_PIPELINE_RETAIN` | `` | Keeps Postgres running after the job has completed | | `GCP_BQ_DUMP_BUCKET` | `` | A GCP bucket to dump content from BigQuery | | `DEBUG_MODE` | `` | Toggles bash's debug mode | ## Prepare Log into gcloud ```bash gcloud auth login ``` Set the GCP project ```bash gcloud config set project k8s-infra-ii-sandbox ``` Log into application-default ```bash gcloud auth application-default login ``` ## Running the Pipeline Manually Run in Docker ```bash TMP_DIR_ETL=$HOME/.tmp/public-log-asn-matcher-$RANDOM sudo mkdir -p $TMP_DIR_ETL sudo chmod 0777 ${TMP_DIR_ETL} sudo chown -R 999 ~/.config/gcloud # allow for postgres user docker run \ -d \ -e DEBUG_MODE=true \ -e TZ=$TZ \ -e POSTGRES_PASSWORD="postgres" \ -e GCP_PROJECT=k8s-infra-ii-sandbox \ -e GCP_BIGQUERY_DATASET=etl_script_generated_set \ -e GCP_BQ_DUMP_BUCKET=ii_bq_scratch_dump \ -v $HOME/.config/gcloud:/var/lib/postgresql/.config/gcloud \ -v "${TMP_DIR_ETL}:/tmp" \ --name public-log-asn-matcher \ gcr.io/k8s-staging-infra-tools/public-log-asn-matcher docker logs -f public-log-asn-matcher ``` ### Clean up Change permissions of ~/.config/gcloud back ```bash sudo chown -R $(id -u) ~/.config/gcloud ``` ## Generating the bucket list A list of buckets is used for their `.*_usage` bucket which stores the public access logs ```bash ./generate-buckets.sh ``` ## Run a test build ```bash TAG="$(date -u '+%Y%m%d')-$(git describe --tags --always --dirty)" PROJECT="k8s-infra-ii-sandbox" BUCKET="ii_bq_scratch_dump" gcloud builds submit \ --verbosity info \ --config cloudbuild.yaml \ --substitutions _TAG="${TAG}",_GIT_TAG="${TAG}" \ --project ${PROJECT} \ --gcs-source-staging-dir "gs://${BUCKET}/source" \ . ```