Merge branch 'main' of https://github.com/brightway-lca/bw_interface_schemas

cmutel · cmutel · commit 2b4f51bb53f3 · 2024-12-10T13:31:31.000+01:00
diff --git a/.github/workflows/python-test.yml b/.github/workflows/python-test.yml
@@ -59,4 +59,4 @@ jobs:
           pytest
 
       - name: Upload coverage reports to Codecov
-        uses: codecov/codecov-action@v4
+        uses: codecov/codecov-action@v5
diff --git a/README.md b/README.md
@@ -9,15 +9,45 @@
 
 `bw_interface_schemas` defines a set of [pydantic](https://docs.pydantic.dev/2.0/) classes which will be the fundamental data schema for Brightway Cloud, the next iteration of the Brightway LCA software ecosystem. These schemas provide clear and consistent graph-based interfaces between Brightway software libraries, and simplify and harmonize the way data was modeled and stored in Brightway.
 
-We have chosen to model all data in a graph, as a list of nodes and edges. This includes inventory data, which models how processes consume and produce products to form supply chains. It also includes impact assessment, where elementary flows are linked to impact categories via characterization edges, and data organization. Now both projects and databases are also in the graph, and process and product nodes are linked to databases via `belongs_to` relationship edges.
+We have chosen to model all data in a directed graph, i.e. as nodes and (directed) edges. This includes inventory data, which models how processes consume and produce products to form supply chains, but also includes impact assessment, where elementary flows are linked to impact categories via characterization edges, and collections, where processes and products belong to databases.
 
 ## Example
 
 Here is our standard bicycle production example in the new paradigm:
 
 <img src="example.png">
 
-You can see two ways of building this graph in code in `tests/conftest.py`.
+You can see this graph in code in `tests/conftest.py`.
+
+## Motivation
+
+In previous Brightway versions, the libraries were tightly coupled, and the schemas for passing data between Python libraries or with other software were never explicitly defined or custom-developed. This led to a chaos of utility conversion functions without any guarantees on broad format or on the availability of specific attributes.
+
+The approach in `bw_interface_schemas` allows for a more modular approach, where Brightway IO libraries can work with multiple data stores or data generation and manipulation packages. The definition of nodes and edges is clear, and using `pydantic` gives us reasonable error messages and validation performance.
+
+We have also fixed some poor design choices in older Brightway versions. For example, previously edges were defined on nodes: `node['exchanges'] = list`. In this schema, the `node` was *always* the edge `target`, even if that didn't make any sense. So emissions were inputs, goods being produced were inputs, etc. Edges also had to be quantitative, so dummy values were added to make proxies for qualitative edges.
+
+Another poor design choice was storing some aspects of the graph outside of the graph. Things like project, databases, impact categories, and other LCIA objects, were stored in a different format (JSON or pickle) in a separate place (filesystem instead of relational database). We now unify these concepts and their data schemas in single graph.
+
+## Design decisions
+
+* All data is in a graph. That means that the only way we have to express data is nodes linked with edges. A single graph can provide all the information in a project.
+* Add attribute data is stored as JSON-serializable values.
+* Nodes have identifiers. We stores nodes as a dictionary, where the identifiers are the keys. Edge `source` and `target` attributes refer to these identifiers. Identifiers can be strings or integers, and their label in the node datasets themselves is flexible.
+* Nodes and edges have types, and type labels are given in a set of `Enum` classes. These types correspond with pydantic classes which include custom data attributes and validation functions.
+* Edges have direction, and their direction is meaningful. For example, a process producing a product would have an edge from (`source`) the process to (`target`) a product. If the product was consumed as an input of the process, the product would be the `source` and the process would be the `target`. The same logic applies to processes and elementary flows.
+* The technosphere part of the graph has a strict product -> process -> product pattern. Edges between processes and products must state whether or not they are functional. A functional edge is one where the modeller has indicated that the product being consumed or produced is one of the functions of the process.
+* Processes are located in time and space. Products can be generic (their attributes apply regardless of where or when the product is produced or consumed, such as products meeting some standard), or can have spatio-temporal specificity (the sulfur content or energetic density of natural gas varies across time and space).
+* Elementary flows and products can refer to the same underlying concepts, but are distinct nodes. For example, carbon dioxide has industrial uses and is also an important air resource and emission, but because it operates in different contexts in all three cases, it is modeled as different objects. Biosphere edges always link process nodes to elementary flow nodes, and elementary flow nodes cannot operate in the technosphere.
+* There is no rigid normalization pattern. Edges allow for some degree of normalization (edge source and targets act like foreign keys to nodes), but other attributes like units are not normalized. Our intent is to specify some of these non-normalized attributes in the [Sentier.dev](https://vocab.sentier.dev/en-US/) vocabulary, and to develop practical approaches to other tricky attributes like location.
+
+## Tags, properties, or dataset attributes?
+
+* Tags are for choosing from an already known set of possibilities where more than one node could share the same value.
+* Properties are for numeric values which describe the object's attributes or performance.
+* Dataset attributes (i.e. `node['foo']`) are for everything else.
+
+As always, [hard cases make bad law](https://en.wikipedia.org/wiki/Hard_cases_make_bad_law), and some things could fit into multiple possible buckets. We will expand and clarify this distinction with more experience.
 
 ## Comparison with Brightway2
 
diff --git a/tests/integration.py b/tests/integration.py
@@ -1,9 +1,9 @@
 from bw_interface_schemas import GraphLoader
 
 
-def dump_graph(bike_as_graph):
+def test_dump_graph(bike_as_graph):
     assert bike_as_graph.model_dump()
 
 
-def construct_graph(bike_as_dict):
+def test_construct_graph(bike_as_dict):
     assert GraphLoader(identifier_field="name").load(bike_as_dict, use_identifiers=True)