Spark connector #25

angelcervera · 2017-07-08T11:17:31Z

It is possible to create a connector to easy access from Spark.
After that, it will not be necessary to put blocks in hdfs and work directly with the pbf file.

Articles related:

Other connectors as example:

ericsun95 · 2020-07-30T14:38:04Z

It is possible to create a connector to easy access from Spark.
After that, it will not be necessary to put blocks in hdfs and work directly with the pbf file.

Good referenece, the MongoDB connector: https://databricks.com/session/how-to-connect-spark-to-your-own-datasource

This is a very cool suggestion. Any updates?

angelcervera · 2020-07-30T18:42:52Z

I'm waiting for that because Spark distributions usually use Scala 2.11
The latest osm4scala version does not support 2.11 : https://github.com/simplexspatial/osm4scala#selecting-the-right-versions
The lastest osm4scala version only adds Scala 2.13 support and an updated version Scalapb (that does not support Scala 2.11).

If you think that this is interesting and people will use it, I can work on that. What do you think?

ericsun95 · 2020-07-30T20:13:29Z

I think it's good to have. If it can support 2.11 & 2.13 that would be better. Currently, it's not convenient for us to ingest osm data through spark. The databricks-xml may loss some data when reading big xml files, the spark-osm-datasource and oms-parquetizer are not maintained for a long time. If we could make use of osm2scala and connect it to spark, it would be very cool.

ericsun95 · 2020-09-04T20:16:40Z

Any updates? Interested in the follow-ups.

angelcervera · 2020-09-05T07:57:09Z

Hi @ericsun95 I'm sorry, but I was on holiday. Tomorrow, I will be able to expend time on this task and I will ping you back.

angelcervera · 2020-09-06T14:20:05Z

Hi @ericsun95
Because SQL is not friend of polymorphisms, I thinking about the more useful way to represent any type of entities in a Row.
Do you think that this would work in your case?

case class OsmSqlEntity(
                         id: Long,
                         `type`: OSMTypes.Value,
                         latitude: Double,
                         longitude: Double,
                         nodes: Seq[Long],
                         relations: Seq[RelationMemberEntity],
                         tags: Map[String, String]
                       )

ericsun95 · 2020-09-06T18:38:03Z

Hi @ericsun95
Because SQL is not friend of polymorphisms, I thinking about the more useful way to represent any type of entities in a Row.
Do you think that this would work in your case?

case class OsmSqlEntity(
                         id: Long,
                         `type`: OSMTypes.Value,
                         latitude: Double,
                         longitude: Double,
                         nodes: Seq[Long],
                         relations: Seq[RelationMemberEntity],
                         tags: Map[String, String]
                       )

I think this is an interesting topic. If you check the https://github.com/woltapp/spark-osm-datasource#schema, it use a similar structure which also includes common entity fields (but it's weird it set visible for way & relation as false). And for java based like https://github.com/adrianulbona/osm-parquetizer/tree/master/src/main/java/io/github/adrianulbona/osm/parquet/convertor, it follows a structure with common abstract class and following three types entities (node & way & relation). I see, it seems osm4scala current didn't include those common fields (user, uid, visible, changsetId). Which may need in some cases.

My preference is adding the common fields in your current OsmSqlEntity case class, or had a reader which read specifically one type of OSMType with strong type schema.

One thing I am truly interested here, did pbf block store some relevance of entities (like way/relation has most reference node in one block)? If yes, could we get a partition-wise strategy in spark after reading as dataframe so that we can avoid too much shuffle in future? If super relation exists, how could we still partition the data in a wise way? I think that's very important if using spark to process osm data.

Thanks for your time.

ericsun95 · 2020-09-06T18:44:26Z

For naming of those fields, I personally prefer following https://github.com/openstreetmap/osmosis/tree/master/osmosis-core/src/main/java/org/openstreetmap/osmosis/core/domain/v0_6. As most osm projects followed the same style. While, it depends on you. Clients can always rename.

angelcervera · 2020-09-07T18:15:55Z

There is no documentation about how to write Spark Connectors, so I was doing a little bit of reverse engineering. We are lucky because usually, all this stuff is opensource: Spark build-in Connector, Cassandra connector, etc.

This task is a little bit complex, so I'm going to split it in different step:

Spark - Sequential parsing / Spark - Sequential parsing #36: I will implement a sequential reader before thinking about how to chunk the pbf file. It takes 40 minutes to read the full planet, so let see if even in this case the performance is good. Maybe is possible to start to process data before reading the full file, so processing will be executed in parallel with the parsing. Maybe data transfer will be a bottleneck.
Split file / Spark Splitting file #37: Implement with splitting, to take advance of distributed storage as HDFS or S3.
Prune fields and Filters / Prune fields and Filters #38: I will implement these optimizations for Spark SQL, to be able to remove not necessary fields and rows at the parsing time.
Intelligent partitioning, localization, etc. / Spark Connector new ideas. #39: Let's talk in this task about other improvements, like partitioning per pdf block, per geolocation (I will need it for SimpleXSpatial Server seeding), etc.
Expose info fields / Expose info fields #40: This is not related to Spark, but It will be useful. For my original use case, I did not need information fields like changeset, user id, etc.. But it is true that other people with different user cases would neet It. Also, It is not a huge effort to add it.

@ericsun95 in relation to

One thing I am truly interested here, did pbf block store some relevance of entities (like way/relation has most reference node in one block)? If yes, could we get a partition-wise strategy in spark after reading as dataframe so that we can avoid too much shuffle in future? If super relation exists, how could we still partition the data in a wise way? I think that's very important if using spark to process osm data.

The osm.pbf format does not have this information, but maybe will be possible to collect this type of metric at parsing time.
Or even is possible to generate an index with these type of data to allow to split the file and do a more intelligent partitioning at the same time. But at the moment, let's keep it simple. I don't have a lot of free time. 😢

angelcervera · 2020-09-07T18:33:40Z

Let's comment every task in the right issue so we can focus.

angelcervera · 2020-09-19T17:57:51Z

Hi @ericsun95
I have the first version of the connector ready to release. But there is an issue that I want to share to find feedbacks.

osm4scala and spark are using the Google Protobuf library, but two different versions that are not compatible.

The solution is to shade the library in a fat-jar.

To do this, there are three options:

Don't shade the library but the user application. The user of the osm4scala connector will need to create a fat jar with his application and shade osm4scala or the Google Protobuf library. I don't like the idea to force to the user to create a fat jar. In fact, I always try to avoid it, because at the end, this practice brings a lot of hidden surprises.
Release the spark connector as a shaded fat library. It will force me to shade all libraries, to about conflict with others used by the application.
Release two versions of osm4scala, the current one with the latest version of Google Protobuf and another one with the version 2.5.0 (from Mar, 2013 😭) used by Spark There is another option. This option will be the simplest to use for the final user, but it will be a pain to maintain.

Tbh, I don't like any option. Your opinion? From the user's point of view, it is fine for you to do a fat jar or it is better to use a fat library?

ericsun95 · 2020-09-21T19:00:42Z

Hi @ericsun95
I have the first version of the connector ready to release. But there is an issue that I want to share to find feedbacks.

osm4scala and spark are using the Google Protobuf library, but two different versions that are not compatible.

The solution is to shade the library in a fat-jar.

To do this, there are three options:
1. Don't shade the library but the user application. The user of the osm4scala connector will need to create a fat jar with his application and shade osm4scala or the Google Protobuf library. I don't like the idea to force to the user to create a fat jar. In fact, I always try to avoid it, because at the end, this practice brings a lot of hidden surprises.

2. Release the spark connector as a shaded fat library. It will force me to shade all libraries, to about conflict with others used by the application.

3. Release two versions of osm4scala, the current one with the latest version of Google Protobuf and another one with the version 2.5.0 (from Mar, 2013 😭) used by Spark There is another option. This option will be the simplest to use for the final user, but it will be a pain to maintain.
Tbh, I don't like any option. Your opinion? From the user's point of view, it is fine for you to do a fat jar or it is better to use a fat library?

Just curious, were there any other ways to fix the conflicts? Like forcing to pick the latest version? I prefer the option 2. Option 1 seems to be more complicated to person not so familiar with osm4scala. Anyway, nice job!

angelcervera · 2020-09-28T06:09:14Z

Hi @ericsun95
I released #41 with the first connector version.

Enjoy

ericsun95 · 2020-09-30T20:31:49Z

Hi @ericsun95
I released #41 with the first connector version.

Enjoy

Cool. Nice job!!

angelcervera · 2021-01-19T07:58:44Z

Keeping #39 and #40 open, but the spark connector is working and used in production for a while. So this ticket is done.

angelcervera added the enhancement label Jul 8, 2017

angelcervera self-assigned this Sep 6, 2020

angelcervera added the spark Spark connector label Sep 7, 2020

angelcervera added the grooming label Oct 19, 2020

angelcervera closed this as completed Jan 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark connector #25

Spark connector #25

angelcervera commented Jul 8, 2017 •

edited

Loading

ericsun95 commented Jul 30, 2020

angelcervera commented Jul 30, 2020 •

edited

Loading

ericsun95 commented Jul 30, 2020 •

edited

Loading

ericsun95 commented Sep 4, 2020

angelcervera commented Sep 5, 2020

angelcervera commented Sep 6, 2020

ericsun95 commented Sep 6, 2020

ericsun95 commented Sep 6, 2020

angelcervera commented Sep 7, 2020 •

edited

Loading

angelcervera commented Sep 7, 2020

angelcervera commented Sep 19, 2020 •

edited

Loading

ericsun95 commented Sep 21, 2020

angelcervera commented Sep 28, 2020

ericsun95 commented Sep 30, 2020

angelcervera commented Jan 19, 2021

Spark connector #25

Spark connector #25

Comments

angelcervera commented Jul 8, 2017 • edited Loading

ericsun95 commented Jul 30, 2020

angelcervera commented Jul 30, 2020 • edited Loading

ericsun95 commented Jul 30, 2020 • edited Loading

ericsun95 commented Sep 4, 2020

angelcervera commented Sep 5, 2020

angelcervera commented Sep 6, 2020

ericsun95 commented Sep 6, 2020

ericsun95 commented Sep 6, 2020

angelcervera commented Sep 7, 2020 • edited Loading

angelcervera commented Sep 7, 2020

angelcervera commented Sep 19, 2020 • edited Loading

ericsun95 commented Sep 21, 2020

angelcervera commented Sep 28, 2020

ericsun95 commented Sep 30, 2020

angelcervera commented Jan 19, 2021

angelcervera commented Jul 8, 2017 •

edited

Loading

angelcervera commented Jul 30, 2020 •

edited

Loading

ericsun95 commented Jul 30, 2020 •

edited

Loading

angelcervera commented Sep 7, 2020 •

edited

Loading

angelcervera commented Sep 19, 2020 •

edited

Loading