Skip to content
This repository has been archived by the owner on Dec 15, 2021. It is now read-only.

Latest commit

 

History

History
105 lines (85 loc) · 3.64 KB

Databricks.md

File metadata and controls

105 lines (85 loc) · 3.64 KB

Databricks Readme

To be able to use this connector with Databricks there are a few things you need to do

First and most important is that the Google BQ client expects an Env variable with the location of the json key file, so ensure your Databricks cluster has the file saved with all new clusters

You can add that via a init scipt, save it inside the /databricks/init/ folder

Once you save the file locally use the databricks api to launch a cluster with an env variable set up You can use this curl command to do that

%sh curl -H "Content-Type: application/json" -X 
POST \ -u username:passord url-of-databricks-instance:443/api/2.0/clusters/create \
 -d 
 ' { "cluster_name" : "spark 161-hadoop-1", 
 "spark_version": "1.6.1-ubuntu15.10-hadoop1", 
 "num_workers" : 1, 
 "spark_conf" : { "spark.speculation" : true }, 
 "spark_env_vars" : {"GOOGLE_APPLICATION_CREDENTIALS" :"/databricks/Justeat-platform-events-c750aee2059a.json"} 
 }'

Once that is done you need to shade your dependencies because databricks has been set up with avro 1.4 and this connector uses spark avro 1.8, also there are Guava dependency conflicts that need resolving

You can use the maven shade plugin or the sbt

Below is an example build.sbt file

name := "SampleInDatabricks"

version := "1.0"
scalaVersion := "2.11.8"
assemblyJarName := "uber-SampleInDatabricks-1.0-SNAPSHOT.jar"
libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "2.0.0" % "provided",
  "org.apache.spark" %% "spark-streaming" % "2.0.0" % "provided",
  "org.apache.spark" %% "spark-sql" % "2.0.0" % "provided",
  "joda-time" % "joda-time" % "2.9.6"
)

// META-INF discarding
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) => {
  case PathList("META-INF", xs@_*) => MergeStrategy.discard
  case x => MergeStrategy.first
}

If you use SBT ensure that the connector jar is in the lib folder, the shade plugin will package everything in the lib folder into the uber jar

Or if Maven is your preference, below is an example of the maven pom

<plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.3</version>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
          </execution>
        </executions>
        <configuration>
          <minimizeJar>true</minimizeJar>
          <shadedArtifactAttached>true</shadedArtifactAttached>
          <shadedClassifierName>fat</shadedClassifierName>
          <relocations>
            <relocation>
              <pattern>com.google</pattern>
              <shadedPattern>shaded.guava</shadedPattern>
              <includes>
                <include>com.google.**</include>
              </includes>
          <excludes>
            <exclude>com.google.common.base.Optional</exclude>
            <exclude>com.google.common.base.Absent</exclude>
            <exclude>com.google.common.base.Present</exclude>
          </excludes>
        </relocation>
      </relocations>
      <filters>
        <filter>
          <artifact>*:*</artifact>
          <excludes>
            <exclude>META-INF/*.SF</exclude>
            <exclude>META-INF/*.DSA</exclude>
            <exclude>META-INF/*.RSA</exclude>
          </excludes>
        </filter>
      </filters>
      <finalName>uber-${project.artifactId}-${project.version}</finalName>
    </configuration>
  </plugin>