Some hints on Dataproc

ROBIN DONG 2021-09-03 11:54

  1. When running a job in the cluster of Dataproc, it reported:
java.util.concurrent.ExecutionException: java.lang.ClassNotFoundException: Failed to find data source: BIGQUERY.

The reason is I haven’t added the Jar file for BigQuery. After adding the new Jar file intopropertiesto the template of creating a cluster:

          spark:spark.jars: gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.18.1.jar

the job starts to read data from BigQuery tables.

Remember not to usegs://spark-lib/bigquery/spark-bigquery-latest.jarbecause it will hang your job when you are reading BigQuery tables. Seems even google makes a significant mistake in their cloud platform :p

2. If a PySpark job needs to use some additional packages in the Dataproc cluster, what should we do?

Still need to add more items in the template to let it install pip packages:

    clusterName: robin
          enable-cloud-sql-proxy-on-workers: 'false'
          use-cloud-sql-private-ip: 'false'
          PIP_PACKAGES: 'google-cloud-storage google-api-python-client google-auth'
      - executableFile: gs://goog-dataproc-initialization-actions-us-central1/python/
        executionTimeout: 600s

3. To see how a Hive table be created

show create table <table>;

