Forem: Fluree Dev

What is SHACL?

Fluree Dev — Wed, 28 Feb 2024 22:06:00 +0000

SHACL is a critical tool for anyone involved in the management, curation, and utilization of RDF data. Its ability to enforce complex data quality rules in a flexible and scalable manner makes it indispensable for maintaining high-quality, interoperable datasets in the age of big data and the semantic web.

What is SHACL?

Shapes Constraint Language (SHACL) is a powerful language for validating RDF (Resource Description Framework) knowledge graphs against a set of conditions. These conditions are known as shapes and can be used by data governance professionals to streamline consistent data quality across the organization’s data ecosystem. SHACL was developed by the World Wide Web Consortium (W3C) into the open, industry standard for data quality assurance in data-centric, semantic technologies and linked data projects.

In Fluree’s graph database, users leverage policy-based SHACL to express constraints like required properties, data types, cardinality, and closed class shapes.

What is the difference between OWL and SHACL?

OWL and SHACL are typically compared as both ways in which to manage and maintain datasets, but their core competencies differ. OWL helps with inference; SHACL helps with validation. Let’s break it down further:

OWL is primarily designed for defining and reasoning about ontologies. It provides a rich set of constructs for describing the relationships between concepts in a domain, enabling sophisticated inferencing capabilities about the types of entities and their relationships. OWL is used to create complex domain models and to infer new knowledge from existing data.

SHACL, on the other hand, is focused on data validation. It allows developers and data architects to define constraints on the structure and content of RDF graphs, ensuring that the data adheres to specified patterns, value ranges, or other criteria. While OWL focuses on enabling inference, SHACL is specifically tailored for validation, offering a more direct approach to enforcing data quality rules.

What does SHACL look like?

SHACL uses a graph-based syntax, where shapes are defined as RDF graphs. Let’s take a look at a simple example: let's say that we want to ensure that all values assigned to the property "schema:birthday" are enforced as xsd:dateTime (in plain english: birthdays are formatted as valid dates and times). In Fluree, we could insert our constraint in a ledger like this:

{
  "@context": "https://ns.flur.ee",
  "ledger": "ledger/data-type",
  "insert": {
    "@id": "ex:UserShape",
    "@type": ["sh:NodeShape"],
    "sh:targetClass": { "@id": "ex:Person" },
    "sh:property": [
      {
        "sh:path": { "@id": "schema:birthDate" },
        "sh:datatype": { "@id": "xsd:dateTime" }
      }
    ]
  }
}

This example defines a shape for ex:Person instances, ensuring that the ‘birthDate’ property of such entities, if present, must be a valid date-time value.

What are the benefits of using SHACL for data validation?
Data governance is increasingly becoming critical, as organizations need to leverage clean, available, and trusted data on a day-to-day basis. By employing a global framework for data validation, organizations can save an immense amount of time and effort in standardizing data for analytics and re-use.

SHACL offers several benefits for data validation:

Flexibility and Expressiveness: SHACL can express a wide range of constraints, including property existence, value type, value range, cardinality, and more complex conditions.
Scalability: SHACL validation can be efficiently implemented, making it suitable for large datasets.
Standardization: As a W3C standard, SHACL ensures interoperability and consistency across different tools and platforms. This helps make not only the data interoperable, but also the rules that govern data consistency and validation interoperable.
Declarative Approach: SHACL allows the expression of validation rules in a declarative manner, separating the rules from their execution, which can improve maintainability and understandability of data constraints.

What kinds of quality constraints can SHACL enforce?

SHACL can enforce a variety of quality constraints, including:

Structural Constraints: Ensuring that data adheres to a specific schema or model, such as required properties, permissible property values, or specific class hierarchies.
Value Constraints: Limiting the values that can be taken by properties, including data types, value ranges, and pattern matching.
Cardinality Constraints: Defining the minimum and maximum occurrences of properties.
Logical Constraints: Applying logical conditions to properties and values, such as equality or inequality, and combinations thereof through logical operators.

How does SHACL improve enterprise data at scale?

For enterprises dealing with vast amounts of data, maintaining data quality is paramount. SHACL provides a robust framework for ensuring that data across the organization conforms to agreed-upon standards and models. At scale, SHACL helps in:

Automating Data Validation: Automated tools can leverage SHACL shapes to validate data as it is ingested, updated, or transformed, ensuring continuous data quality without manual intervention.
Enforcing Data Governance: SHACL shapes can embody data governance policies, ensuring compliance with internal and external data standards and regulations.
Improving Data Interoperability: By enforcing standardized data models and structures, SHACL facilitates data sharing and interoperability both within the enterprise and with external partners.
Enhancing Data Quality: Consistent application of SHACL validation helps identify and rectify data quality issues early, reducing errors and improving the reliability of data-driven decisions.

Try SHACL out with Fluree!
Head on over to our cookbook documentation for some examples that you can test out with your free Fluree cloud account!

What is JSON-LD?

Fluree Dev — Tue, 23 Jan 2024 17:33:17 +0000

Future-Proof Your Data with JSON-LD

Fluree has officially released native JSON-LD support. But what exactly is JSON-LD, and why does it matter?

For developers used to working with document data stores like MongoDB, JSON-LD is a bridge to the world of linked data. Linked data is the reason Google has been able to provide such unique and sophisticated services to consumers -- but that power is about to expand to far more than simple search engine optimization. Linked data allows organizations, people, and applications to access and share distributed sets of knowledge under global semantic standards- a powerful concept in the context of the modern data age.

At Fluree, we are big fans of the semantic web standards - especially RDF - because they allow for data to be organized and exchanged with universal meaning. Semantic standards are necessary for a future internet in which machines and humans can seamlessly connect to, leverage, and collaborate around disparate sets of linked data. The implications - especially in the era of emerging AI technologies, enterprise knowledge graphs, and industry data-sharing ecosystems - are profound.

But we are also realists - in the last ten years, most developers have chosen convenience over the grand vision of the semantic web, opting for JSON document data stores. JSON’s incredibly lightweight interchange format has seen mass adoption thanks in part to the influence of JavaScript on software over the last decade.

But the simplicity of JSON backfires when it comes to any case of reusing, sharing, or linking data. Without adding the context of data types, classifications, and relationships, data can easily become isolated, leading to a fragmented and siloed understanding of its implications.

That's why we're passionate about the JSON-LD specification — which Fluree has fully implemented and now supports.

What is JSON-LD?

JSON-LD -- or JavaScript Object Notation Linked Data -- combines the simplicity and ease of JSON documents with the power of linked data. JSON-LD allows you to encode meaning within your JSON document through the use of a shared vocabulary - a total upgrade from your typical JSON payload.

JSON-LD is JSON for linked data. In fact, A JSON-LD document is always a valid JSON document. Broadly speaking, there are two core capabilities that the “LD” part provides:

It allows you to embed or reference the definition of objects within your JSON payload. For example, the @context element embedded in a JSON-LD document maps object properties to concepts from an RDF ontology (a universally standardized schema).
It allows you to assign IRIs to properties in order to establish uniqueness.

🔍 What is an IRI?

An IRI (Internationalized Resource Identifier) is a generalization of the more commonly known URI (Uniform Resource Identifier). While URIs are limited to a subset of the ASCII character set, IRIs provide a way to include characters from a wider range, including Unicode. This expansion allows for the use of scripts and characters from various languages around the world, making it more globally inclusive.

In the context of linked data and the semantic web, IRIs are used to uniquely identify resources. This ensures that there's no ambiguity when referring to particular entities, concepts, or relationships, regardless of where they originate or the language they're represented in.

Understanding @ context, @ id and @ type

@ context - The @ context tag in a JSON-LD payload gives us just that - context - in the form of mapping objects in the document to a vocabulary. For example, one might map @ context to schema.org, a well-known reference site for structured data schemas.

@ id - The @ id tag gives us a way to establish uniqueness for data - which is fundamentally critical to prevent ambiguity and duplication in linked data systems. The @ id tag is sort of like a digital version of an address where we can ‘park’ unique data at URIs and point to them later on. Uniqueness in data allows us to distinguish between things or recognize that one thing in a dataset is the exact same as one thing in another dataset.

@ type - Represents the data type or class of the object, providing a way to categorize the data in terms of what it represents in the real world.

Let’s take a look at a JSON-LD document:

{
  "@context": {
    "schema": "https://schema.org/",
    "name": "schema:name",
    "author": "schema:author",
    "datePublished": "schema:datePublished",
    "articleBody": "schema:articleBody"
  },
  "@type": "schema:Article",
  "@id": "https://example.com/articles/json-ld-introduction",
  "name": "Introduction to JSON-LD",
  "author": {
    "@type": "schema:Person",
    "name": "John Doe"
  },
  "datePublished": "2023-10-27",
  "articleBody": "JSON-LD is a powerful way to represent linked data using JSON. It combines the simplicity of JSON with the capabilities of linked data, allowing for richer metadata, better SEO, and improved data interoperability..."
}

@ context: This defines the context of the JSON-LD document. It maps terms to IRIs so that properties such as "name" or "author" are clearly associated with concepts defined at "schema.org", a popular vocabulary for structured data on the internet.
@ type: Specifies that this JSON-LD document represents an article. The "schema:Article" IRI defines this as an article type as per the schema.org vocabulary.
@ id: Provides a unique identifier for this article. In this case, it's a URL to the article on "example.com".
author: Contains information about the author of the article. Nested within it, we see another @type that specifies this entity as a "Person" as per the schema.org vocabulary. The author has a "name" property.

Why use JSON-LD?

JSON-LD gives developers a familiar format in which to deliver the value of linked data and provides organizations with an easy path to integrate the benefits of linked data into a broader range of their data management operations.

Webinar | Getting Started with Fluree V3 - YouTube

Are you ready for Fluree Version 3? With an intuitive cloud-based managed service option and full JSON-LD support, we are excited to walk through our product...

youtube.com

What’s the deal with Linked Data? Linked data is the practice of connecting data across domains through the use of semantic standards. Linked data uses standardized schemas known as ontologies to transform information into organized, connected, and contextual knowledge. These properties allow data to become interlinked and machine-readable on the web and promote knowledge discovery and federated analytics. Linked data can be applied within the enterprise in the form of semantic knowledge graphs, on the public internet in the form of Open Linked Data, or in a hybrid model in the form of semantic data-sharing ecosystems.

JSON-LD allows organizations to gain all the benefits of linked data - without the hassle and overhead of transforming data or forcing developers to use a foreign data format. JSON-LD immediately increases the value of JSON documents by making the data stored within them interoperable with other linked data systems - such as internal knowledge graphs or on the open linked web - as well as by providing an accessible medium for developers and applications to interact with linked data.

Here’s a few examples of how JSON-LD makes an impact:

Building Data-Centric Applications - Software increasingly produces data that has more than one purpose - whether analytics, sharing, compliance reporting, or re-operationalization into a new application - data is versatile. JSON-LD future-proofs data for reuse within a data-centric architecture, allowing data to exist independently of a singular application and to power multiple data applications without duplication. Read more about data-centricity here.
Participation in the Semantic Web - JSON-LD gives document data a path towards participation in the semantic web. This core initial use case has seen wide adoption in the SEO (Search Engine Optimization) field because JSON-LD helps add metadata and structure to webpages. Beyond SEO, JSON data will need to interoperate at webscale for a variety of use cases - and JSON-LD is the perfect tool for the job.
Broadening the Impact and Capabilities of Enterprise Knowledge Graphs: Linking data allows the formation of ontology-driven knowledge graphs, a data management trend that supports advanced analytics and insights through the use of semantic data standards. JSON-LD allows knowledge graphs to incorporate semantically tagged documents as part of their data domains - but it also provides a medium in which to operationalize applications atop knowledge graphs.
Knowledge graphs should be dynamic and operational in that they power multiple data applications and are contributed to by multiple data source domains. This is supported by our data-centric philosophy in which data should be able to connect to, influence and power multiple data applications. Because JSON-LD extends JSON with linked data capabilities, adding and deleting objects in knowledge graphs is as simple as performing transactions against a document database like MongoDB. This read/write capability then allows a knowledge graph to serve as an operational backend to real-time web-based applications that use a JSON interface.
Feeding Analytics Better Data - JSON-LD extends the value and versatility of data with context, thereby preparing it for downstream activity. For instance, JSON-LD prepares data (that would otherwise be too simple or obscure) for an instantaneous analytical experience within data science teams. Because JSON-LD data arrives with self-describing context and a standardized semantic schema, downstream systems can immediately understand, evaluate, and take action on insights. For these systems, JSON-LD makes data immediately F.A.I.R. (Findable, Accessible, Interoperable and Reusable).
Machine-readability - JSON-LD upgrades JSON documents to become machine-readable, meaning computers can automatically exchange, access and understand information without the need for a human intermediary. This capability will be crucial for data to participate in the machine-to-machine web and power the next wave of automation and AI.
Data sharing ecosystems - Data sharing ecosystems are on the rise, introducing complexities that data systems have not had to account for in the past. For example, data from a single repository might be contributed to by multiple organizations. Or, records in a consortium might need to flow across systems and applications - in a hospital system, for example, patient information is securely collaborated on by different clinics and insurance providers.
Linked data supports these complexities - with semantic standards, we can exchange information across systems, execute federate queries across multiple disparate data sets, and collaborate around data sets with standardized schemas.
Verifiable Credentials & Decentralized Semantics - Allows for authentication and assurance of the data's source, ensuring its validity and credibility. By incorporating decentralized semantics, data gets rooted in a universally accepted framework, making it trustworthy and interoperable across a network. (check out our cool work with the Trusted Learner Network!)

Conclusion

At Fluree, our vision of the Web3.0 involves an internet of trusted and linked data. The problem is that our tools haven’t addressed the need for data to be linked - developers have found document databases quite preferable for simplicity’s sake. It’s time to prepare our data -- at the source of its creation -- with capabilities for interoperability and exchange across disparate data systems. JSON-LD is that needed bridge. Try JSON-LD out for yourself here.

There are emerging innovations for trusted and interoperable data exchange that JSON-LD will help fuel - including personal knowledge graphs and verifiable credentials. We’ll cover examples of those specifications in future blog posts.

Resources:

(https://json-ld.org/)
(https://flur.ee/json-ld/)
(https://www.slideshare.net/gkellogg1/json-for-linked-data )

Seamless Data Sharing, Baked Right In

Fluree Dev — Thu, 21 Dec 2023 20:25:39 +0000

Explore fundamental principles, uncover challenges, and discover how Fluree's open web standards effortlessly facilitate sharing data.

As things start to wind down at the end of the year, those of us in tech start looking for easy wins, like updating our email signatures to match the new company branding guidelines or scheduling a “first thing next year” check-in for that data integration effort that has as much chance of reporting progress as Santa’s milk and cookies have seeing Christmas Day.

We who manage data and share it with others knowingly smile because we’ve all seen a “data integration” project or even a “simple data request” that seems straightforward but is actually fraught with unforeseen challenges and setbacks. Why is it that these kinds of efforts are so widely known to cause grief? It’s because in a career that’s so often filled with miraculous technological advances and software tools that exponentially narrow development time and widen the impact of our efforts, we feel ridiculous estimating (and inevitably extending) a timeline of half a year or more for any project that involves sharing existing data across applications and firewalls.

Let’s get into the fundamentals of sharing data, expose the challenges lurking in the implementation details, and learn why projects that use Fluree need not be concerned about any of this thanks to a few open web standards – baked right into Fluree – that enable seamless data sharing.

The chestnut of this article came from our walkthrough documentation on the fictional Academic Credential Dataset. If you’d like to dive deeper into solving the issues tackled in this post, head on over to our docs site to see what it’s like solving hard problems with Fluree.

The Data Sharing Problem

In our fantastic doc Collaborative Data we highlight the core problem with sharing data: naming! Because you and I describe the same things in different ways – my Customers table and your CustomerDetails table might both describe Freddy the Yeti and may even contain the same exact data for Freddy, but in typical database systems, there is no way for us to know we’re describing the same yeti, nor can we even compare the data we have on Freddy from our separate systems without a mapping definition and process.

In fact, when there are any discrepancies between the source and destination schemas, the task of sharing data shifts from a self-service and, dare I say, automated task to a more complex endeavor that requires business analysts, developers, database administrators, and, of course, project managers to corral the whole circus of defining, communicating, building, and validating the necessary data pipeline. I may be exaggerating a bit, especially for smaller requests where a simple export of a small set of records is concerned, but even for a small task, some understanding of the target data and context is required by the provider and consumer to make sense of the request and resulting data. Note that in the case of building a data pipeline, much of the burden is put on the shoulders of tech knowledge workers where we’re expected to learn and reason about multiple contexts and construct systems integrations that must communicate over time and space and handle edge cases and dirty data and will eventually be asked to shoulder the weight of maintenance and changes of requirements and feature creep. This is where the cost and grief comes from.

If maintaining a consistent mapping is crucial for the data owner, it can be achieved by layering it in as the source data is added. However, this approach often results in data duplication, as the information must be stored in multiple formats. Alternatively, the mapping can be automated and done on-the-fly, as the data is requested, but this takes development resources and, depending on the amount of data and frequency of requests, can get expensive (and deeply annoying). Neither of these methods takes into account scenarios where mappings evolve over time or when there are numerous requestors, each with their unique data format requirements.

Okay so differing data schemas mean trouble for data sharing, making it complex, expensive and generally slow. So why do we have these differences in the first place? If it's such an effort to map and transform data, why can't the receiver just use the same schema as the sender? Or vice versa?

There are many reasons that vary with size and scale, but most of them boil down to communication, coordination, and cost.

Different Goals, Different Contexts

When building data exports, APIs, and other data sharing infrastructure, data owners lean on their own internal understandings of their data. There are intrinsic properties of the data (e.g. relationships, data types, field names) that only exist as a byproduct of the context in which the data owner collected and generated the data, and yet these properties dictate the format, shape, and content of the data being shared. On the other hand, each and every data consumer (those that receive data from data owners) have another distinct understanding of the data they're interested in. They operate within a different context. They have applications and analytics that are built on data of a certain shape with certain properties that have certain names and types.

Efficiently conveying these distinct contexts and ensuring that everyone consistently employs the same data context for their specific use cases can appear to be an insurmountable challenge. On the other hand, if we permit each consumer to maintain their own context, any modification to the data sharing infrastructure necessitates an equivalent degree of communication and coordination, resulting in each individual bearing the cost of staying up-to-date.

The challenges with data formatting and mapping make sharing data and hosting data sources difficult to accomplish and, when successful, constrained to niche, data-intensive research fields that require a consistent context. To mitigate, these problem spaces must rely on centralized data brokers that dictate the sharing format and other rules of engagement. This setup means relinquishing data ownership and control, reduced benefit of data partnerships, and the limited reach of knowledge and information.

The tl;dr is: the current state of data infrastructure can only produce data silos which constrain the impact of our data.

In an ideal world, we would all use the same schema, the same data context. We would use the same table names, use the same column names and data types, and, while we’re dreaming, we’d use the same identifiers for our records! That way, there’s no need for a translation layer, there’s just retrieving data. Seems silly right? There’s no way we can all use the same database schema. What would that even mean?

Fluree builds on web standards to provide essential features that, when combined, solve these traditionally challenging problems. One of these standards is the JSON-LD format, which gives our data the ability to describe itself enabling portability beyond the infrastructure where it originated. We call this “schema on write" and “schema on read,” which just means developers can build their databases on top of universal standards, and that data can be immediately shared, pre-packaged with mappings and context for consumer use cases. Let’s take a closer look at how Fluree’s approach to data management obviates these problems.

The Data Sharing Solution

What does it mean for our data to be able to "describe itself" and how does this concept solve these longstanding data-sharing problems? I mentioned the term "context" a bit in my statement of the problem. In a nutshell, the data context contains all of the semantic meaning of the data. This includes field names, relationships among objects and concepts, type information like is this data a number, a date, or a more complex class like an invoice or a patient visit. This contextual data is traditionally defined all over the place: in table and column definitions, in application and API layers, in data dictionaries distributed with static datasets, in notes columns in a spreadsheet, in the head of the overworked and under-resourced data architect. This contextual data, as discussed, can be difficult to maintain, represent, and transfer to data consumers.

But wait! What if all this data was stored and retrieved as data in the dataset? What would it look like if we took all of the contextual data that can be found in the best, most-complete data dictionary, API documentation, or SQL schema explorer and just inject it right in there with the data content itself? JSON-LD and a few other open web standards, like RDF and RDFS, do this exactly and Fluree relies on them to enable simple and seamless data sharing.

RDF and JSON-LD are simply data formats that can represent graph data. We go into more detail in our Data Model doc and there are some excellent resources online as RDF has been around for a bit. RDFS is an extension of RDF that adds some very useful data modeling concepts like classes and subclasses, which enables us to describe the hierarchies in our data. JSON-LD and its ability to convey contextual and vocabulary data alongside the data itself is talked about extensively in our excellent Collaborative Data doc.

The gist is that by using universally defined identifiers for both subjects and properties, all participants (both data sources and consumers) can build on top of a fully-defined and open understanding of the data schema. No more data silos.

Oh hey! Thanks for reading! I’ll leave you with another benefit of using Fluree: portability! Portability, the opposite of “vendor lock in”, is another one of Fluree’s incredible side effects. Because Fluree is built on open standards, like the ones discussed in this article, all of the value provided is baked right into the data itself! This means that Fluree is relying on externally-defined mechanisms (including the storage format, RDF) that have meaning outside of any Fluree database or platform. So when sharing your data or if you decide to use a different database in the future, all of the self-describability goes along for the ride!

Blog credits to Fluree Developer Jack White