Skip to content

Spark connector #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
angelcervera opened this issue Jul 8, 2017 · 15 comments
Closed

Spark connector #25

angelcervera opened this issue Jul 8, 2017 · 15 comments
Assignees
Labels

Comments

@angelcervera
Copy link
Member

angelcervera commented Jul 8, 2017

It is possible to create a connector to easy access from Spark.
After that, it will not be necessary to put blocks in hdfs and work directly with the pbf file.

Articles related:

Other connectors as example:

@ericsun95
Copy link
Contributor

It is possible to create a connector to easy access from Spark.
After that, it will not be necessary to put blocks in hdfs and work directly with the pbf file.

Good referenece, the MongoDB connector: https://databricks.com/session/how-to-connect-spark-to-your-own-datasource

This is a very cool suggestion. Any updates?

@angelcervera
Copy link
Member Author

angelcervera commented Jul 30, 2020

I'm waiting for that because Spark distributions usually use Scala 2.11
The latest osm4scala version does not support 2.11 : https://github.com/simplexspatial/osm4scala#selecting-the-right-versions
The lastest osm4scala version only adds Scala 2.13 support and an updated version Scalapb (that does not support Scala 2.11).

If you think that this is interesting and people will use it, I can work on that. What do you think?

@ericsun95
Copy link
Contributor

ericsun95 commented Jul 30, 2020

I think it's good to have. If it can support 2.11 & 2.13 that would be better. Currently, it's not convenient for us to ingest osm data through spark. The databricks-xml may loss some data when reading big xml files, the spark-osm-datasource and oms-parquetizer are not maintained for a long time. If we could make use of osm2scala and connect it to spark, it would be very cool.

@ericsun95
Copy link
Contributor

Any updates? Interested in the follow-ups.

@angelcervera
Copy link
Member Author

Hi @ericsun95 I'm sorry, but I was on holiday. Tomorrow, I will be able to expend time on this task and I will ping you back.

@angelcervera angelcervera self-assigned this Sep 6, 2020
@angelcervera
Copy link
Member Author

Hi @ericsun95
Because SQL is not friend of polymorphisms, I thinking about the more useful way to represent any type of entities in a Row.
Do you think that this would work in your case?

case class OsmSqlEntity(
                         id: Long,
                         `type`: OSMTypes.Value,
                         latitude: Double,
                         longitude: Double,
                         nodes: Seq[Long],
                         relations: Seq[RelationMemberEntity],
                         tags: Map[String, String]
                       )

@ericsun95
Copy link
Contributor

Hi @ericsun95
Because SQL is not friend of polymorphisms, I thinking about the more useful way to represent any type of entities in a Row.
Do you think that this would work in your case?

case class OsmSqlEntity(
                         id: Long,
                         `type`: OSMTypes.Value,
                         latitude: Double,
                         longitude: Double,
                         nodes: Seq[Long],
                         relations: Seq[RelationMemberEntity],
                         tags: Map[String, String]
                       )

I think this is an interesting topic. If you check the https://github.com/woltapp/spark-osm-datasource#schema, it use a similar structure which also includes common entity fields (but it's weird it set visible for way & relation as false). And for java based like https://github.com/adrianulbona/osm-parquetizer/tree/master/src/main/java/io/github/adrianulbona/osm/parquet/convertor, it follows a structure with common abstract class and following three types entities (node & way & relation). I see, it seems osm4scala current didn't include those common fields (user, uid, visible, changsetId). Which may need in some cases.

My preference is adding the common fields in your current OsmSqlEntity case class, or had a reader which read specifically one type of OSMType with strong type schema.

One thing I am truly interested here, did pbf block store some relevance of entities (like way/relation has most reference node in one block)? If yes, could we get a partition-wise strategy in spark after reading as dataframe so that we can avoid too much shuffle in future? If super relation exists, how could we still partition the data in a wise way? I think that's very important if using spark to process osm data.

Thanks for your time.

@ericsun95
Copy link
Contributor

For naming of those fields, I personally prefer following https://github.com/openstreetmap/osmosis/tree/master/osmosis-core/src/main/java/org/openstreetmap/osmosis/core/domain/v0_6. As most osm projects followed the same style. While, it depends on you. Clients can always rename.

@angelcervera angelcervera added the spark Spark connector label Sep 7, 2020
@angelcervera
Copy link
Member Author

angelcervera commented Sep 7, 2020

There is no documentation about how to write Spark Connectors, so I was doing a little bit of reverse engineering. We are lucky because usually, all this stuff is opensource: Spark build-in Connector, Cassandra connector, etc.

This task is a little bit complex, so I'm going to split it in different step:

  1. Spark - Sequential parsing / Spark - Sequential parsing #36: I will implement a sequential reader before thinking about how to chunk the pbf file. It takes 40 minutes to read the full planet, so let see if even in this case the performance is good. Maybe is possible to start to process data before reading the full file, so processing will be executed in parallel with the parsing. Maybe data transfer will be a bottleneck.
  2. Split file / Spark Splitting file #37: Implement with splitting, to take advance of distributed storage as HDFS or S3.
  3. Prune fields and Filters / Prune fields and Filters #38: I will implement these optimizations for Spark SQL, to be able to remove not necessary fields and rows at the parsing time.
  4. Intelligent partitioning, localization, etc. / Spark Connector new ideas. #39: Let's talk in this task about other improvements, like partitioning per pdf block, per geolocation (I will need it for SimpleXSpatial Server seeding), etc.
  5. Expose info fields / Expose info fields #40: This is not related to Spark, but It will be useful. For my original use case, I did not need information fields like changeset, user id, etc.. But it is true that other people with different user cases would neet It. Also, It is not a huge effort to add it.

@ericsun95 in relation to

One thing I am truly interested here, did pbf block store some relevance of entities (like way/relation has most reference node in one block)? If yes, could we get a partition-wise strategy in spark after reading as dataframe so that we can avoid too much shuffle in future? If super relation exists, how could we still partition the data in a wise way? I think that's very important if using spark to process osm data.

The osm.pbf format does not have this information, but maybe will be possible to collect this type of metric at parsing time.
Or even is possible to generate an index with these type of data to allow to split the file and do a more intelligent partitioning at the same time. But at the moment, let's keep it simple. I don't have a lot of free time. 😢

@angelcervera
Copy link
Member Author

Let's comment every task in the right issue so we can focus.

@angelcervera
Copy link
Member Author

angelcervera commented Sep 19, 2020

Hi @ericsun95
I have the first version of the connector ready to release. But there is an issue that I want to share to find feedbacks.

osm4scala and spark are using the Google Protobuf library, but two different versions that are not compatible.

The solution is to shade the library in a fat-jar.

To do this, there are three options:

  1. Don't shade the library but the user application. The user of the osm4scala connector will need to create a fat jar with his application and shade osm4scala or the Google Protobuf library. I don't like the idea to force to the user to create a fat jar. In fact, I always try to avoid it, because at the end, this practice brings a lot of hidden surprises.
  2. Release the spark connector as a shaded fat library. It will force me to shade all libraries, to about conflict with others used by the application.
  3. Release two versions of osm4scala, the current one with the latest version of Google Protobuf and another one with the version 2.5.0 (from Mar, 2013 😭) used by Spark There is another option. This option will be the simplest to use for the final user, but it will be a pain to maintain.

Tbh, I don't like any option. Your opinion? From the user's point of view, it is fine for you to do a fat jar or it is better to use a fat library?

@ericsun95
Copy link
Contributor

Hi @ericsun95
I have the first version of the connector ready to release. But there is an issue that I want to share to find feedbacks.

osm4scala and spark are using the Google Protobuf library, but two different versions that are not compatible.

The solution is to shade the library in a fat-jar.

To do this, there are three options:

1. Don't shade the library but the user application. The user of the osm4scala connector will need to create a fat jar with his application and shade osm4scala or the Google Protobuf library. I don't like the idea to force to the user to create a fat jar. In fact, I always try to avoid it, because at the end, this practice brings a lot of hidden surprises.

2. Release the spark connector as a shaded fat library. It will force me to shade all libraries, to about conflict with others used by the application.

3. Release two versions of osm4scala, the current one with the latest version of Google Protobuf and another one with the version 2.5.0 (from Mar, 2013 😭) used by Spark There is another option. This option will be the simplest to use for the final user, but it will be a pain to maintain.

Tbh, I don't like any option. Your opinion? From the user's point of view, it is fine for you to do a fat jar or it is better to use a fat library?

Just curious, were there any other ways to fix the conflicts? Like forcing to pick the latest version? I prefer the option 2. Option 1 seems to be more complicated to person not so familiar with osm4scala. Anyway, nice job!

@angelcervera
Copy link
Member Author

Hi @ericsun95
I released #41 with the first connector version.

Enjoy

@ericsun95
Copy link
Contributor

Hi @ericsun95
I released #41 with the first connector version.

Enjoy

Cool. Nice job!!

@angelcervera
Copy link
Member Author

Keeping #39 and #40 open, but the spark connector is working and used in production for a while. So this ticket is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants