<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: David Kershaw</title>
    <description>The latest articles on Forem by David Kershaw (@david_kershaw_b6916404da6).</description>
    <link>https://forem.com/david_kershaw_b6916404da6</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3452728%2Fbd83dcba-604e-4b61-9794-32cd2e66b911.jpg</url>
      <title>Forem: David Kershaw</title>
      <link>https://forem.com/david_kershaw_b6916404da6</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/david_kershaw_b6916404da6"/>
    <language>en</language>
    <item>
      <title>JSONPath Is In! The AI Assistant Will See You Now</title>
      <dc:creator>David Kershaw</dc:creator>
      <pubDate>Thu, 29 Jan 2026 01:22:12 +0000</pubDate>
      <link>https://forem.com/david_kershaw_b6916404da6/jsonpath-is-in-the-ai-assistant-will-see-you-now-5cge</link>
      <guid>https://forem.com/david_kershaw_b6916404da6/jsonpath-is-in-the-ai-assistant-will-see-you-now-5cge</guid>
      <description>&lt;p&gt;JSONL is a neat and &lt;a href="https://dev.to/david_kershaw_b6916404da6/jsonl-is-a-seriously-weird-format-5flp"&gt;kind of weird data format&lt;/a&gt;. It is well-known to be useful for logs and API calls, among other things. And your favorite AI assistant API is one place you'll probably find it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt; supports validating JSONL. (In fact, it supports JSONL for the whole data preboarding lifecycle, but that's a longer story). &lt;/p&gt;

&lt;p&gt;And now &lt;a href="https://github.com/csvpath/csvpath" rel="noopener noreferrer"&gt;CsvPath Validation Language&lt;/a&gt; supports JSONPath expressions. Since AI prompts are only kinda sorta JSONL, having JSONPath to dig into them is helpful.&lt;/p&gt;

&lt;p&gt;What I mean by kinda-sorta is that your basic prompt sequence is a series of JSON lines, but the lines are all 1-column wide and arbitrarily deep. That sounds more like a series of JSON "documents" than it does like single JSONL stream. Or, anyway, that's my take.&lt;/p&gt;

&lt;p&gt;Let's look at how to use JSONPath to inspect a JSONL file using CsvPath Validation Language in &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Data&lt;/a&gt;. For those of you who don't already know, FlightPath Data is the dev and ops frontend to CsvPath Framework. It is a free and open source download from the &lt;a href="https://apps.apple.com/us/app/flightpath-data/id6745823097" rel="noopener noreferrer"&gt;Apple MacOS Store&lt;/a&gt; or the &lt;a href="https://apps.microsoft.com/detail/9P9PBPKZ4JDF" rel="noopener noreferrer"&gt;Microsoft Store&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The file is a common example prompt. We'll start by looking at one line.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fneyr9u9scxkyth89izit.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fneyr9u9scxkyth89izit.png" alt=" " width="800" height="217"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's the last line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "messages": [
    {
      "role": "system",
      "content": "You are a happy assistant that puts a positive spin on everything."
    },
    {
      "role": "user",
      "content": "I'm hungry."
    },
    {
      "role": "assistant",
      "content": "Eat a banana!"
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From CsvPath Framework's perspective this is a one-header document. The one header is &lt;code&gt;messages&lt;/code&gt;. If you open this in the grid view you see only the one header. &lt;em&gt;(i.e. one column; but with delimited files we try to stick to the word "header" because with "column" your RDBMS-soaked brain starts to make incorrect assumptions)&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Here's what it looks like: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwuhhs552iqsbo8oa6i4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwuhhs552iqsbo8oa6i4.png" alt=" " width="800" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That's not super fun. The reason is that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;JSONL doesn't present its headers in the grid view (for good reasons)&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;messages&lt;/code&gt; header is arbitrarily deeply nested, unlike the typical JSONL log line&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Nevertheless, that's what we have. Will it blend? I mean validate? Yes. JSONPath to the rescue. That said I'll pause to admit that I'm not a JSONPath expert.&lt;/p&gt;

&lt;p&gt;Right-click in the project files tree on the left of FlightPath and create a new &lt;code&gt;.csvpath&lt;/code&gt; file, e.g. &lt;code&gt;messages.csvpath&lt;/code&gt;. Drop this simple example in it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$[*][ 
    push("roles", jsonpath(#messages, "$[?(@.role == 'assistant')].content") )

    last() -&amp;gt; print("See the variables tab for results")
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see the &lt;code&gt;jsonpath()&lt;/code&gt; function. It is acting on the &lt;code&gt;messages&lt;/code&gt; header, as we'd want. We're pushing the data pulled by the JSONPath expression into a stack variable named &lt;code&gt;roles&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;A stack variable is like a Python list or tuple. You create a variable by using it. &lt;code&gt;roles&lt;/code&gt; is part of the set of zero or more variables that are available throughout the csvpath statement run. They are captured to the Variables tab for a one-off FlightPath Data test run. For a production run they end up in the &lt;code&gt;vars.json&lt;/code&gt; file in the run results.&lt;/p&gt;

&lt;p&gt;Put your cursor in the csvpath statement and click &lt;code&gt;cmd-r&lt;/code&gt; (or &lt;code&gt;ctrl-r&lt;/code&gt; on Windows). The output tabs should open at the bottom-middle of the screen, below your csvpath file. Click to the &lt;code&gt;Variables&lt;/code&gt; tab and have a look.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flgjj2zonsdrbyltabkp3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flgjj2zonsdrbyltabkp3.png" alt=" " width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our JSONPath:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$[?(@.role == 'assistant')].content
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;picked out the objects in the &lt;code&gt;messages&lt;/code&gt; array where &lt;code&gt;role&lt;/code&gt; equaled &lt;code&gt;assistant&lt;/code&gt;. And from those objects it extracted and returned the value of the &lt;code&gt;content&lt;/code&gt; key. &lt;/p&gt;

&lt;p&gt;Pretty simple stuff. Tho, I have to admit it took me a few minutes to wrap  my JSONPath-inexperienced head around the context for the JSONPath expression. I was thinking of the whole document or the whole line, but that wasn't right.&lt;/p&gt;

&lt;p&gt;It is obviously the JSON value assigned to the &lt;code&gt;messages&lt;/code&gt; key, which is an array, in this case. Once I was operating from that correct context, the JSONPath became pretty straightforward. (Those of us with XPath scars need not be as afraid as we might be!)&lt;/p&gt;

&lt;p&gt;The point here is two-part. First, we can deal with AI prompts or any other JSONL that is deeply nested. Hooray! The data may look odd, if you are comparing it to regular tabular data, but that's no reason to not validate. &lt;/p&gt;

&lt;p&gt;Second, this example makes the point that we're doing JSONPath rules-based validation within our CsvPath context. How very Schematron-like, since Schematron does XPath validation within XSLT. &lt;/p&gt;

&lt;p&gt;Maybe this sounds complicated, but really it's not. CsvPath Validation Language is great for all things line-oriented. In this case, there isn't much for it to do, except hand off to JSONPath, which is great at documents (a.k.a. objects). Simple enough.&lt;/p&gt;

&lt;p&gt;If we wanted to create a bunch of JSONPath rules to validate our AI prompt JSONL, we could do that. To just do a quick throw-away second rule as an example try this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$[*][ 
    push("roles", jsonpath(#messages, "$[?(@.role == 'assistant')].content") )

    @stmts = jsonpath(#0, "$.`len`")
    print("Line $.csvpath.line_number: $.variables.stmts")
    @stmts == 3 

    last.nocontrib() -&amp;gt; print("See the variables tab for results")
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That new rule will net you 2 lines, which are either valid or failed, depending on how you want to use your csvpath statement. You will see them in the &lt;code&gt;Matches&lt;/code&gt; tab.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmi8fjj1hpn7desej6v4x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmi8fjj1hpn7desej6v4x.png" alt=" " width="800" height="241"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the same time the expanded csvpath statement will continue to pull in the same data to the variables tab that we got with the first version of the csvpath.&lt;/p&gt;

&lt;p&gt;To clean it up just a little, you can do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$[*][ 
    push("roles", jsonpath(#messages, "$[?(@.role == 'assistant')].content") )

    jsonpath(#0, "$.`len`") == 3
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There you go, a valid 2-rule validation statement using JSONPath on nested JSON in a JSONL document. Useful? Totally! Give it a try.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>promptengineering</category>
      <category>json</category>
      <category>python</category>
    </item>
    <item>
      <title>Custom Functions FTW</title>
      <dc:creator>David Kershaw</dc:creator>
      <pubDate>Wed, 28 Jan 2026 22:20:50 +0000</pubDate>
      <link>https://forem.com/david_kershaw_b6916404da6/custom-functions-ftw-21co</link>
      <guid>https://forem.com/david_kershaw_b6916404da6/custom-functions-ftw-21co</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/csvpath/csvpath" rel="noopener noreferrer"&gt;CsvPath Validation Language&lt;/a&gt; is functions-based. It applies a very simple syntax and a large number of functions to &lt;strong&gt;validate CSV, JSONL, and Excel files&lt;/strong&gt; in ways that were never-before possible.&lt;/p&gt;

&lt;p&gt;And then comes the moment when you want to do some crazy thing that the &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt; contributors didn't think of. What to do? You create a custom function that does exactly that.&lt;/p&gt;

&lt;p&gt;Custom functions even work in &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Data&lt;/a&gt; and &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Server&lt;/a&gt;. I call that out because FlightPath Data is a multi-project environment. And FlightPath Server is both multi-project and also multi-user. That means functions must be scoped and sandboxed. They are and they work great! &lt;/p&gt;

&lt;p&gt;Let's create a trivial example to show the setup of a custom function. I'll leave the actual functionality as an exercise for the reader, since that part is demonstrated copiously in the &lt;a href="https://github.com/csvpath/csvpath" rel="noopener noreferrer"&gt;CsvPath Framework Github repo&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The goal
&lt;/h2&gt;

&lt;p&gt;Our goal is to create a function called &lt;code&gt;sure()&lt;/code&gt;. It will functionally be the same as &lt;code&gt;yes()&lt;/code&gt;. I.e. it returns &lt;code&gt;True&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Our csvpath statement will be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    $[*][ sure() ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you try that in FlightPath Data you will get this error message:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkuf5lufn5e7jd9qyqd66.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkuf5lufn5e7jd9qyqd66.png" alt=" " width="800" height="125"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The config.ini
&lt;/h2&gt;

&lt;p&gt;The first step is to point to a functions import file. By default, import files are called &lt;code&gt;functions.imports&lt;/code&gt; and live in the project's &lt;code&gt;config&lt;/code&gt; directory. In FlightPath Data, click &lt;code&gt;Open config&lt;/code&gt; at the bottom left of the app to open the config panel. Then click &lt;code&gt;functions&lt;/code&gt; in the vertical tabs to open the functions config form. The form has just one field for the path to the imports file. The path can be relative or absolute.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff0x7hp5ihznt6u4qx7l9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff0x7hp5ihznt6u4qx7l9.png" alt=" " width="800" height="279"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;Once the path is ready click &lt;code&gt;Save and reload&lt;/code&gt; and then &lt;code&gt;Close config&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Next we need to edit the imports file to include our &lt;code&gt;sure()&lt;/code&gt; function. Right click on the blank space in the project files tree and select &lt;code&gt;Open project directory&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi4i2kvoh5nxrov6lqw0a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi4i2kvoh5nxrov6lqw0a.png" alt=" " width="800" height="517"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Open the config directory and you should see three files, &lt;code&gt;config.ini&lt;/code&gt;, &lt;code&gt;env.json&lt;/code&gt;, and &lt;code&gt;functions.imports&lt;/code&gt;. If you don't see all three don't worry about it; some files are generated just in time. If there is no &lt;code&gt;functions.imports&lt;/code&gt; create one. Then open it.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;functions.imports&lt;/code&gt; we're going to add one line that imports our &lt;code&gt;sure()&lt;/code&gt; function.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feoi83e3bzpwrimelucfs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feoi83e3bzpwrimelucfs.png" alt=" " width="800" height="117"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is basically the same form as Python's. It says find the &lt;code&gt;example/one/yes.py&lt;/code&gt; file and import the &lt;code&gt;Yes&lt;/code&gt; class, using the name &lt;code&gt;sure&lt;/code&gt; as the function name of the imported class. Simple!&lt;/p&gt;

&lt;p&gt;Finally we just need to put the custom function in the right place. The right place, starting from the project's home directory, is &lt;code&gt;config/example/one/yes.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I copied the regular &lt;code&gt;Yes&lt;/code&gt; class from its file in the repo to make my example &lt;code&gt;yes.py&lt;/code&gt;. Again, we're just setting up a custom function, not showing how to write an awesome function.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feqe8ltv62s385c5eb68k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feqe8ltv62s385c5eb68k.png" alt=" " width="800" height="347"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is where the &lt;code&gt;yes.py&lt;/code&gt; file lives.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flnqlgbga5ksekjnxa4ek.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flnqlgbga5ksekjnxa4ek.png" alt=" " width="800" height="95"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And... we're basically ready to go. However, if you ran a csvpath already, restart FlightPath to clear the function classes that were already loaded. You can do this programmatically in CsvPath Framework, but there's no button in FlightPath's config yet.&lt;/p&gt;

&lt;p&gt;That done, back in FlightPath Data right click in your project files tree and create a new file. Call it &lt;code&gt;test.csvpaths&lt;/code&gt;, or whatever you like.&lt;/p&gt;

&lt;p&gt;In it, paste our test csvpath:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    $[*][ sure() ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make it look like: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0otvx1lohznlrg68rabf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0otvx1lohznlrg68rabf.png" alt=" " width="800" height="218"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, with your cursor inside the csvpath statement, click &lt;code&gt;cmd-r&lt;/code&gt; (or &lt;code&gt;ctrl-r&lt;/code&gt; on Windows). &lt;/p&gt;

&lt;p&gt;You should see the message &lt;code&gt;Test run complete. Matched 2 lines.&lt;/code&gt; in the status bar and the printouts tab should be blank. (If your status bar says &lt;code&gt;Test run complete&lt;/code&gt; but has a different number of lines don't worry about it; your test data and mine just aren't the same).&lt;/p&gt;

&lt;p&gt;For a bit of comfort that all is working as expected, add a print line like this and you should get the same printouts shown.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmu4yytt4rs8wq54y0mt2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmu4yytt4rs8wq54y0mt2.png" alt=" " width="800" height="311"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And that's all there is to it. Now you'll never be stumped by the absence of &lt;em&gt;Cool Function X&lt;/em&gt;, because you can write it yourself. &lt;/p&gt;

&lt;p&gt;To be fair, though, while a simple function can be trivial, as we just saw,  more complex functions can be... well, more complex. If you need help creating your awesome function don't hesitate to reach out. We'll be glad to help you get started.&lt;/p&gt;

</description>
      <category>data</category>
      <category>dataengineering</category>
      <category>python</category>
      <category>qc</category>
    </item>
    <item>
      <title>Comparing Validatar to CsvPath Validation</title>
      <dc:creator>David Kershaw</dc:creator>
      <pubDate>Wed, 28 Jan 2026 02:33:38 +0000</pubDate>
      <link>https://forem.com/david_kershaw_b6916404da6/comparing-validatar-to-csvpath-validation-1m4o</link>
      <guid>https://forem.com/david_kershaw_b6916404da6/comparing-validatar-to-csvpath-validation-1m4o</guid>
      <description>&lt;p&gt;Let's go back to the buffet and compare &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt; and &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Data&lt;/a&gt; to another validation tool. Today we'll look at &lt;a href="https://www.validatar.com/" rel="noopener noreferrer"&gt;Validatar&lt;/a&gt;. Once more into the breach dear friends! &lt;/p&gt;

&lt;p&gt;As with the other comparisons, please remember that data quality tools like &lt;a href="https://dev.to/david_kershaw_b6916404da6/comparison-csvpath-vis-a-vis-sodacl-2669"&gt;SodaCL&lt;/a&gt;, &lt;a href="https://dev.to/david_kershaw_b6916404da6/comparing-great-expectations-and-csvpath-framework-3lfg"&gt;Great Expectations&lt;/a&gt;, or today's contestant, Validatar, only do data quality. CsvPath Framework, by contrast, is a data-file feeds management infrastructure that covers data validation as just one aspect of the &lt;strong&gt;full data preboarding lifecycle&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;Moreover, CsvPath Framework does not deal with relational databases (other than as an option for storing its own metadata). Validatar et. al., are first and foremost relational database quality management tools, and only secondarily deal with data files. So it's a mismatch, to some degree, but useful and entertaining nonetheless.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Validatar Example&lt;/li&gt;
&lt;li&gt;The CsvPath Way&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Validatar example we'll replicate using CsvPath Validation Language and FlightPath Data is at &lt;a href="https://docs.validatar.com/docs/exercise-10-create-a-uniqueness-monitoring-template-for-csv-files" rel="noopener noreferrer"&gt;https://docs.validatar.com/docs/exercise-10-create-a-uniqueness-monitoring-template-for-csv-files&lt;/a&gt;. Just from the name of the exercise you know this is more up CsvPath Framework's alley than it is Validatar's. &lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  The Validatar Example
&lt;/h1&gt;

&lt;p&gt;Here's the problem description at the top of the Validatar example: &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This Standard Test is designed to demonstrate the concept of how to create a uniqueness template test for all CSV files in multiple folders.&lt;br&gt;
The Standard Test here compares the Row Count per &lt;code&gt;account_id&lt;/code&gt; value in the &lt;code&gt;account_data.csv&lt;/code&gt; to make sure all &lt;code&gt;account_id&lt;/code&gt;'s only have 1 row. The test only keeps failures and stops after 100 failure records.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Spoiler alert: in FlightPath this is a trivial example (as I &lt;em&gt;think&lt;/em&gt; it is &lt;em&gt;meant&lt;/em&gt; to be)&lt;/p&gt;

&lt;p&gt;Validatar starts by having you create a test template. Before you can do that, though, you need a project. Here are the instructions for that step: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Choose your project&lt;/li&gt;
&lt;li&gt;Make sure your Data Source is Mapped correctly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first bullet sounds simple. I'm not sure what the second bullet means because I'm not a Validatar expert and that one isn't explained on that page.&lt;/p&gt;

&lt;p&gt;Moving on, let's create that template. Most of this exercise is forms based. The setup is shown in the image below. Their ask is that you notice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Note that the column specified to group by is account_id&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Note that it is comparing the ROW_COUNT to a fixed value of 1&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Note that the Result Configuration is set so that only Failures are kept and to abort after 100 failures are found&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Good requirements for us to use on the CsvPath side. &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ipx2l8stt9rkr2a8l93.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ipx2l8stt9rkr2a8l93.png" alt=" " width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ssus8b8d102fsq3ygnh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ssus8b8d102fsq3ygnh.png" alt=" " width="800" height="352"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At this point we have our test. Now we need to create a template from it so that we can apply it to each CSV file. This is how we get to a single action we can apply to multiple files in a uniform way. &lt;/p&gt;

&lt;p&gt;I'm going to just add the bullets because the screenshot is in the link above, which is of course a more complete description. We do: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Click Build Template&lt;/li&gt;
&lt;li&gt;Update the Folder input to &lt;code&gt;{{schema.name}}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Update the File input to &lt;code&gt;{{table.name}}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Update the Column input to &lt;code&gt;{{#replace table.name "_data.csv" "_id"}}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Update the Metadata Links to &lt;code&gt;{{schema.name}}.{{table..name}}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Change the Generate column list using to Dynamic Template Configuration&lt;/li&gt;
&lt;li&gt;Update the Dynamic Script&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The dynamic script is pretty simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    [
    {"name":"{{#replace table.name "_data.csv" "_id"}}","sequence":1,"type":"Numeric","role":"Key"},
        {"name":"ROW_COUNT","sequence":2,"type":"Numeric","role":"Value"}
    ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, we're going to use some metadata to filter down to the files we care about. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Switch to the Metadata Selection Tab&lt;/li&gt;
&lt;li&gt;Change to the &lt;code&gt;Use Filters&lt;/code&gt; option&lt;/li&gt;
&lt;li&gt;Add a Filter on the Table Name Field that contains "_data.csv"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftb2at9j5j1mpuso8xnoq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftb2at9j5j1mpuso8xnoq.png" alt=" " width="629" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At this point, check that the filter finds your files and run the example. You should be good to go. My feeling is that all works better for database tables than for CSV files, just as you would expect from Validatar.&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Now, CsvPath Framework
&lt;/h1&gt;

&lt;p&gt;Once more, this time with feeling! Let's see how CsvPath Framework and FlightPath Data can make the same magic happen. And, hopefully you'll agree that it's much simpler and more powerful for its use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  The requirements, again
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Create a uniqueness test for all csv files in multiple folders&lt;/li&gt;
&lt;li&gt;The column specified to group by is account_id&lt;/li&gt;
&lt;li&gt;Only keep failures&lt;/li&gt;
&lt;li&gt;Stop after 100 failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The core of these requirements is the validation statement. Using CsvPath Validation Language this is next to trivial:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$[*][ 
    @duplicate_accounts.nocontrib == 100 -&amp;gt; stop()
    has_dups(#account_id) -&amp;gt; counter.duplicate_accounts(1) 
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;(The &lt;code&gt;@&lt;/code&gt; sign means a variable and the &lt;code&gt;#&lt;/code&gt; sign indicates a header name)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This csvpath says: for each line in a file check if the counter is &lt;code&gt;100&lt;/code&gt;. If it is, stop processing that file. Otherwise, increase the counter if the &lt;code&gt;#account_id&lt;/code&gt; is a duplicate.&lt;/p&gt;

&lt;p&gt;The statement will collect only error lines because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The counter is a side-effect with no contribution to matching&lt;/li&gt;
&lt;li&gt;The check if &lt;code&gt;@duplicate_lines&lt;/code&gt; equals &lt;code&gt;100&lt;/code&gt; is marked to not contribute to matching. (Using the &lt;code&gt;nocontrib&lt;/code&gt; qualifier)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The function that does the heavy lifting is &lt;code&gt;has_dups()&lt;/code&gt;. If that returns &lt;code&gt;True&lt;/code&gt; (i.e. the value of &lt;code&gt;true()&lt;/code&gt;) we match the line and capture it.&lt;/p&gt;

&lt;p&gt;All pretty readable. Now what do we do with it?&lt;/p&gt;

&lt;h2&gt;
  
  
  FlightPath Data FTW
&lt;/h2&gt;

&lt;p&gt;All of what we need to do is almost as simple in Python using only CsvPath Framework. Almost! But using FlightPath Data it is even simpler.&lt;/p&gt;

&lt;p&gt;In FlightPath, create a new file called &lt;code&gt;dups.csvpath&lt;/code&gt;. Paste in our statement. &lt;/p&gt;

&lt;p&gt;Right-click on &lt;code&gt;dups.csvpath&lt;/code&gt; and select &lt;code&gt;Load csvpaths&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0b3cvfjaqiqka81kliab.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0b3cvfjaqiqka81kliab.png" alt=" " width="800" height="254"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the load dialog give the named-paths group the name &lt;code&gt;dups&lt;/code&gt; and click &lt;code&gt;Create&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Futrfjh8swlcrhd38zkj1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Futrfjh8swlcrhd38zkj1.png" alt=" " width="800" height="258"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You should see your csvpath show up in the middle window on the right under the &lt;code&gt;dups&lt;/code&gt; folder. When you load a csvpath statement it always goes into a &lt;code&gt;group.csvpaths&lt;/code&gt; file. And when you click on that file its background is pale green to let you know you cannot edit it. (You can, of course, over-write it anytime without losing prior versions, but that is another topic for a different post.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdo57rksvo0yg2fslwv5a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdo57rksvo0yg2fslwv5a.png" alt=" " width="800" height="166"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next stage your data. In the example, each file is in its own folder and its folder is one of many in the same directory. We'll just add the parent folder and let FlightPath find the files for us. &lt;/p&gt;

&lt;p&gt;To do that right click the parent directory and select &lt;code&gt;Stage data&lt;/code&gt;. In the stage data dialog uncheck the &lt;code&gt;Separate named-files&lt;/code&gt; checkbox because we're going to have every physical file be one version of the same named-file. Think of a named-file as a category that has one file assigned to it at a time, in sequence. We say named-files have versions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmjgzd6q8p35lyulot712.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmjgzd6q8p35lyulot712.png" alt=" " width="800" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the named-file name box type &lt;code&gt;accounts&lt;/code&gt;. That's our category. You will see your data in the top-right window as a directory named &lt;code&gt;accounts&lt;/code&gt;. I used a template of :6/:filename in order to keep the month folders, but that is completely optional.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9hgmadwxv9aqqm9nmgt2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9hgmadwxv9aqqm9nmgt2.png" alt=" " width="800" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The result looks like: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvp92dy8ff4mom1m0dmxp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvp92dy8ff4mom1m0dmxp.png" alt=" " width="800" height="193"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Finally, right click the &lt;code&gt;accounts&lt;/code&gt; folder, or the &lt;code&gt;dups&lt;/code&gt; folder below it, and select &lt;code&gt;New run&lt;/code&gt;. In the run dialog, for named-paths select &lt;code&gt;dups&lt;/code&gt;. For named-file type in &lt;code&gt;$accounts.files.:all&lt;/code&gt;. That named-file name is a reference that indicates every version of the &lt;code&gt;accounts&lt;/code&gt; named-file. Again remember, the named-file is like a category that registers a file at a time. We registered a bunch of files and now we're applying our CsvPath Validation Language statement to each of them in turn.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryt8ufxgxzny952uxqj7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryt8ufxgxzny952uxqj7.png" alt=" " width="800" height="517"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And here's the Run dialog: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4455u6kbmbkbta4qypwh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4455u6kbmbkbta4qypwh.png" alt=" " width="800" height="329"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you click &lt;code&gt;Run&lt;/code&gt; you will see your results in the lower right-hand window. Your run is date stamped within the &lt;code&gt;dups&lt;/code&gt; results. In your date-stamped run you can see the &lt;code&gt;data.csv&lt;/code&gt; where your duplicate lines landed. In this image I dropped each run into its own folder using a template; you can see the &lt;code&gt;test&lt;/code&gt; and &lt;code&gt;test2&lt;/code&gt;. That is completely optional, of course.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbtbdho199o0b0ofsock.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbtbdho199o0b0ofsock.png" alt=" " width="718" height="1910"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And that is it!&lt;/p&gt;

&lt;p&gt;There is, of course, much more you can do with CsvPath Framework. Likewise, Validatar has a ton more functionality than what we showed. But now you've had a taste of both. &lt;/p&gt;

&lt;p&gt;What I'd hope you come away with is that CsvPath Framework is the better tool for CSV, JSONL, and Excel file validation. The ease of using &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Data&lt;/a&gt; for this validation example makes the case well. Obviously, for relational database validation, Validatar is your horse.&lt;/p&gt;

&lt;p&gt;And of course I also want to point out again that &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt; is a complete data preboarding solution, not just a validation engine. Preboarding inbound data files is a big deal. If you need that (and who doesn't?) you owe it to yourself to take a look at CsvPath Framework.&lt;/p&gt;

&lt;p&gt;To whet your appetite, here's a post on &lt;a href="https://blog.csvpath.org/should-you-build-or-buy-data-preboarding" rel="noopener noreferrer"&gt;data preboarding build or buy&lt;/a&gt;. Enjoy! &lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>data</category>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>JSONL is a seriously weird format!</title>
      <dc:creator>David Kershaw</dc:creator>
      <pubDate>Thu, 15 Jan 2026 02:56:45 +0000</pubDate>
      <link>https://forem.com/david_kershaw_b6916404da6/jsonl-is-a-seriously-weird-format-5flp</link>
      <guid>https://forem.com/david_kershaw_b6916404da6/jsonl-is-a-seriously-weird-format-5flp</guid>
      <description>&lt;p&gt;&lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt; has had JSONL support for a while. It wasn't until we updated &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Data&lt;/a&gt; to edit JSONL that I really started noticing how seriously free-wheeling JSONL is. At least in the context of a certain kind of developers tool, JSONL is even weirder than CSV. &lt;/p&gt;

&lt;h1&gt;
  
  
  What Wouldn't JSONL Do?
&lt;/h1&gt;

&lt;p&gt;Here are a few of the quirks we pondered: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There is no header order&lt;/li&gt;
&lt;li&gt;Headers change line-by-line...&lt;/li&gt;
&lt;li&gt;...or is there one total set of headers that grows line-by-line?&lt;/li&gt;
&lt;li&gt;Arrays and dictionary lines can cohabitate&lt;/li&gt;
&lt;li&gt;Header values (using CsvPath terms; a.k.a. dictionary values) can be complex objects&lt;/li&gt;
&lt;li&gt;Array lines have headers that equal their values...&lt;/li&gt;
&lt;li&gt;...or do array lines just not have headers?&lt;/li&gt;
&lt;li&gt;...or are array indexes associated with the most recent headers?&lt;/li&gt;
&lt;li&gt;With all this header (a.k.a. column) uncertainty you may have to read the whole file to know what the shape of the data is so that you have a reference point for when you check the lines.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of this and more! &lt;/p&gt;

&lt;h1&gt;
  
  
  The Visualization Is Where They Get You
&lt;/h1&gt;

&lt;p&gt;And it only gets more nuanced and nutty when you try to make a grid view editor. A simple editor has no useful way to show dictionary-line data. The compromises are a big deal. You can choose to make the headers an ever growing set and then sort them and display the data in the grid according to the sorted header name position, thereby allowing you to serialize the data back to JSONL. But that is a lot of choices that add up to a very particular way of handling the data that might not work for everyone.&lt;/p&gt;

&lt;p&gt;We split the difference. FlightPath shows JSONL data in a grid view. However, it does not attempt to show any headers. And if you edit the data in the grid view you have to save it as CSV. Since that's a lot of compromise, we made it possible to use the JSON editing view for JSONL. That gives you syntax highlighting, pretty printing (JSONL style!), and full control of the data, but at the cost of the greater productivity you would have had in a grid view.&lt;/p&gt;

&lt;h1&gt;
  
  
  So Why JSONL?
&lt;/h1&gt;

&lt;p&gt;JSONL is great for data that is less tabular or fluctuates even more wildly than CSV does. It gives cleaner programmatic access to the contents of each line. In most cases, JSONL's key based access is both self-documenting at a line level, albeit leaving it a bit harder to understand the file as a whole, and keeps you from having to access data through indexes. And while it is more verbose for many purposes, JSONL can potentially save significant space in certain wide and sparse use cases. &lt;/p&gt;

&lt;p&gt;All that said, if you're a stickler for correctness, control, and referenceable validity, JSONL might give you hives. Even more than CSV. And that's saying something.&lt;/p&gt;

&lt;p&gt;If you're feeling the need for an antidote to crazy JSONL, CSV, and Excel data, checkout &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt; and &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Data&lt;/a&gt;. &lt;/p&gt;

</description>
      <category>data</category>
      <category>programming</category>
      <category>datascience</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Is CsvPath an easy or hard language?</title>
      <dc:creator>David Kershaw</dc:creator>
      <pubDate>Sun, 14 Dec 2025 05:34:54 +0000</pubDate>
      <link>https://forem.com/david_kershaw_b6916404da6/is-csvpath-an-easy-or-hard-language-1g8l</link>
      <guid>https://forem.com/david_kershaw_b6916404da6/is-csvpath-an-easy-or-hard-language-1g8l</guid>
      <description>&lt;h2&gt;
  
  
  CsvPath. What? Why? How?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt; includes CsvPath Validation Language. To save scarce ASCII, let's call it CVL. &lt;/p&gt;

&lt;p&gt;CVL is a document-oriented tabular data validation language like XSD, Schematron, DDL, and JSONQuery. You can use CVL to validate CSV, Excel, and JSONL files, as well as Pandas-type data frames in memory. The purpose of CVL within CsvPath Framework is to be the validation and upgrading capability within the larger data preboarding architecture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/csvpath/csvpath" rel="noopener noreferrer"&gt;CsvPath Validation Language&lt;/a&gt; is line-based. That means it validates streaming line-by-line, rather than looking at a complete set of data all at once. All of the other languages referenced above are full-set-at-a-time languages, not line-by-line. That means that, in principle, they  trade memory size for speed. In practice the picture is more nuanced. DDL in particular typically has strategies that allow it to handle data larger than RAM. CVL being line-by-line also means it can do some things that the full data set languages cannot (easily) do and it cannot do some things (easily) that they can do. In both cases the runtimes they run within can compensate for limitations, at least to some degree.&lt;/p&gt;

&lt;p&gt;Also worth pointing out, CVL and all of the others mentioned, are declarative languages. By and large, you don't program them to follow a logical algorithm determined by you. Rather you declare what should happen or what you are testing and let the runtime figure out how to give you the desired results.&lt;/p&gt;

&lt;p&gt;CVL is both schema-based and rules-based. Its schema definition capabilities are most like SQL's DDL. Its rules capabilities most like Schematron and SQL's queries. CVL is an intentionally simplistic language, at least in grammatical terms. However, that is like saying the Chinese language is grammatically simple. Yes, but have you seen the characters? &lt;/p&gt;

&lt;p&gt;In CVL's case, while the challenge isn't at the magnitude of tens of thousands of Chinese characters, the set of functions is still substantial. CVL has hundreds of functions that can be used to create rules that go far beyond schemas into business logic validation. Knowing which functions are the best ones to use for a particular purpose is perhaps the biggest challenge. And on top of that are a host of qualifiers, modes, and other power toys.&lt;/p&gt;

&lt;p&gt;Why do we need a tabular data validation language? Because tabular data, CSV especially, is widespread, is way too flexible in ways that hurt productivity and add bottom-line risk, and is frequently under the control of external data partners, not internal staff. Moreover, the lowest cost point, by far, to find and fix data problems is as close the source as possible. Every step data takes inward from the enterprise's edge magnifies the blast radius of errors dramatically. And every time you need to rewind a process or replay a transformation, your cost and risk go up like skyrockets. 🧨&lt;/p&gt;

&lt;h2&gt;
  
  
  A Practical Question
&lt;/h2&gt;

&lt;p&gt;So now we have a sense for what CsvPath Validation Language is and why it is important. A practical next question is what it is going to cost me in time and toil to learn to use it? &lt;/p&gt;

&lt;p&gt;The answer is relative. If your goal is to validate the data types of a data frame now and then, CVL is probably too much trouble compared to hacking up a Python script, especially if you are already comfortable with Pandas or Polars. CVL was not made for one-off validations, although it can certainly do them well enough. &lt;/p&gt;

&lt;p&gt;But if your goal is to run a data collection operation involving tens, hundreds, or thousands of data partners and document types, CVL and CsvPath Framework are vital. Compared to the challenges of: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;designing and building your own framework or &lt;/li&gt;
&lt;li&gt;handling every partnership as a one-off unique project, &lt;/li&gt;
&lt;li&gt;of managing the codebase to control drift and tech debt, &lt;/li&gt;
&lt;li&gt;of maintaining knowledge as team members change over time, or &lt;/li&gt;
&lt;li&gt;compensating for other languages' lack of applicability to the tabular data file problems you face, or&lt;/li&gt;
&lt;li&gt;of firefighting constantly with one arm tied behind your back, &lt;/li&gt;
&lt;li&gt;etc., etc. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compared to all that, CVL is an absolute walk in the park. &lt;/p&gt;

&lt;p&gt;We would argue that CVL is an easy language to use, easy to learn, capable of most situations it was built for, and only about as capable of helping you shoot yourself in the foot as SQL. Which is to say, it's not hard to blow your foot off, but it's quite doable to just not. And, and this is important, CsvPath Validation Language is built for automation, not ad hoc scripting. We expect that you'll treat CVL as code, use iterative dev and tests, create it methodically with an eye on simplicity and reuse, version it, etc. CVL will treat you as well as you treat it, like any other serious language.&lt;/p&gt;

&lt;p&gt;So, after all that, is CVL easy to learn and use? It depends on your perspective. &lt;/p&gt;

&lt;h2&gt;
  
  
  A (Relatively) Simple Example
&lt;/h2&gt;

&lt;p&gt;I'm writing this because I ran into some CVL that was quite straightforward, but also had a lot of small things to remember. That's the &lt;em&gt;"it's easy if you know it"&lt;/em&gt; side of so many things.&lt;/p&gt;

&lt;p&gt;Here's the scenario, it may be familiar to some of you.&lt;/p&gt;

&lt;p&gt;We have a data partner in the retail space who sends us a CSV of orders on a weekly or monthly basis. We want to validate files as the enter the enterprise, before they are accepted as being basically good data. (Downstream our application may reject some orders, but its more exacting business rules are not pertinent to preboarding raw data as it arrives; we're not trying to reinvent the app, just protect it from crap).&lt;/p&gt;

&lt;p&gt;Here is the sample data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ID,date,time,store,address,city,state,zip,category,type,shelf,vendor,product name,UPC,SKU,unit,quantity,a price
03358993,03/21/2024,10:24:14,Bob's store,1 Lakeshore Drive, Chicago, IL, 33581,OFFICE,PAPER,1-5,Sams Paper,20lbs Ream,0301024855,,per each,8,20.99
03358994,03/21/2024,10:31:28,Fred's store,1 Lakeshore Drive, Chicago, IL, 33581,OPERA,WRITING,1-5,Biz Pen,10-pack Black,0541931855,0432950078,per each,2,4
03358995,03/21/2024,11:26:18,Mary's story, 1 Lakeshore Drive, Chicago, IL, 33581,FOOD,CANDY,7,Starbursts,Single,3583900656,0899920453,per each,1,1.29d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We want to catch several things here using rules. Creating a schema would be a different exercise than what I was doing. Here we want to catch: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bad prices&lt;/li&gt;
&lt;li&gt;Wrong categories&lt;/li&gt;
&lt;li&gt;Missing SKU (stock-keeping units)&lt;/li&gt;
&lt;li&gt;Missing UPC (universal product codes)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is one of the exercises on &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;https://www.csvpath.org&lt;/a&gt;, but simplified a bit.&lt;/p&gt;

&lt;p&gt;The way I tackled these rules was to create this csvpath. In the real world, I probably would have chosen to create multiple csvpaths, 1 per rule, and run them as a named-paths group. But here I just whipped up this one-off csvpath. You can tell it's a one-off because it includes the path to a file in its root. I.e. &lt;code&gt;${FILE}&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~
  validation-mode:no-raise, no-stop, print, no-fail, collect
  logic-mode:OR
~
${FILE}[1*][

~ 1x wrong, 2nd item ~
@in = in( #category, "OFFICE|COMPUTING|FURNITURE|PRINT|FOOD|OTHER" )
not( @in.asbool ) -&amp;gt;
    error.category( "Bad category $.headers.category at line $.csvpath.count_lines ", fail())

~ 2x wrong, 2nd and 3rd items ~
@price_format = exact( end(), /\\$?\\d*\\.\\d{{2}}/ )
not( @price_format.asbool ) -&amp;gt;
    error.price("Bad price $.headers.'a price' at line  $.csvpath.count_lines", fail())

~ 1x missing, 1st item ~
#SKU
not( #SKU ) -&amp;gt;
    error.sku("No SKU at line $.csvpath.count_lines", fail())

~ always exists ~
#UPC
not( #UPC ) -&amp;gt;
    error.upc("No UPC at line $.csvpath.count_lines", fail())
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are the things I found myself jotting down as I went. They are not all-new or unfamiliar. At first I just wanted to remember the changes I was making. Then I started thinking about what a new CVL user would have to know.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;THINGS TO PAY ATTENTION TO:
* using OR logic
* the importance of validation-mode
* scanning instructions and headers
* variable existence tests and the asbool qualifier
* side-effects vs. match determiners
* assignments don't determine matches
* header existence tests
* add names to errors, at least here (how id_chains work)
* what is fail()?
* use " on headers in body, use ' on headers in print() and error()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  10 things
&lt;/h2&gt;

&lt;p&gt;Yeah, it's a lot. But does that make CVL hard? Is the sum of the parts powerful? Again, depends on your point of view. I'd say not super hard, but it takes practice, same as any language. And, yes, powerful, if you need it; otherwise, it could be overkill, as I said above. &lt;/p&gt;

&lt;p&gt;Let's go through the items.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using OR logic
&lt;/h3&gt;

&lt;p&gt;The example uses logic-mode to switch from the default ANDing to ORing. Modes are ways of configuring a single csvpath to behave in a certain way without changing your CsvPath Framework project's global config. You set modes in comments at the top of a csvpath. Comments are demarcated with &lt;code&gt;~&lt;/code&gt; characters. &lt;/p&gt;

&lt;p&gt;AND is the default. That means that within the "matching part" of a csvpath each declared unit, called a match component, is ANDed together to determine if a line matches the csvpath statement. (&lt;a href="https://en.wikipedia.org/wiki/Boolean_algebra" rel="noopener noreferrer"&gt;Read about AND and OR here&lt;/a&gt;). That means all match components must evaluate to true (in Python terms &lt;code&gt;True&lt;/code&gt;) for the line to match.&lt;/p&gt;

&lt;p&gt;In this case, we're going to use OR by setting &lt;code&gt;logic-mode: OR&lt;/code&gt;. The decision is somewhat arbitrary for the exercise. We could have used AND with only small changes to the csvpath, but the starting point for my work on this example was already &lt;code&gt;OR&lt;/code&gt; so I left it. &lt;/p&gt;

&lt;p&gt;OR means that each match component has the opportunity to determine if a match. In AND all must be true. In OR all it takes is one true for the line to match. This means we're using an alternate validation strategy. In OR we're saying &lt;em&gt;"good looks like one of these things"&lt;/em&gt;; whereas, in AND we're saying &lt;em&gt;"good requires all of these things"&lt;/em&gt;. &lt;/p&gt;

&lt;h3&gt;
  
  
  The importance of &lt;code&gt;validation-mode&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Another mode, &lt;code&gt;validation-mode&lt;/code&gt;, determines what happens when we hit an error. &lt;code&gt;validation-mode&lt;/code&gt; controls the action that happens when the  Framework runs into a validation or language error. A validation error is a data error. It may be your own custom rule or it may be built-in. A language error is when you try to do something that is syntactically wrong or uses a match component incorrectly.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;validation-mode&lt;/code&gt; has the following options when an error happens:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Raise an exception, resulting in the run terminating noisily&lt;/li&gt;
&lt;li&gt;Print the error to the default printstream (or you can choose another printstream if the error is from your own custom rule)&lt;/li&gt;
&lt;li&gt;Stop the run cleanly&lt;/li&gt;
&lt;li&gt;Fail the run, but keep going&lt;/li&gt;
&lt;li&gt;Collect the error for later programmatic review and/or in a JSON file&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can negate these options by adding &lt;code&gt;no-&lt;/code&gt; as a prefix to the option. E.g. &lt;code&gt;no-raise&lt;/code&gt;. And you can select multiple options at once by separating them with a comma. E.g.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   validation-mode:no-raise, print, fail
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most important things to remember about validation-mode is that &lt;code&gt;raise&lt;/code&gt; will halt runs and print stack traces; whereas, &lt;code&gt;no-raise&lt;/code&gt; will keep going and errors may be harder to detect. &lt;code&gt;no-raise&lt;/code&gt; + &lt;code&gt;no-print&lt;/code&gt; is generally a bad option, at least in dev, because you will have trouble telling what happened when you hit an error. The second most important thing is that if &lt;code&gt;print&lt;/code&gt; is turned off project wide, you lose a lot of information, and in that case using validation mode to turn print on is often a good idea. However, printing is expensive. When you do a lot of printing the run slows down considerably, so if you expect a lot of errors you may want to just collect them silently and review that JSON after the run, rather than printing a ton of lines that may slow a large file down too much.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scanning instructions and headers
&lt;/h3&gt;

&lt;p&gt;This is an easy one. The "scanning part" of a csvpath is the first set of brackets. Often it looks like: &lt;code&gt;[*]&lt;/code&gt;. Scanning instructions can pick out certain lines to validate and skip over other lines. One of the most common use cases is to use &lt;code&gt;[1*]&lt;/code&gt; to say that we want to scan from line 1 to the end of the file. This means the 2nd line to the end, because lines numbers are 0-based. What this does is skip over the header line. The header line is, initially, always line 0. You can reset the headers any time, but that first line is the typical location. &lt;/p&gt;

&lt;p&gt;If you know you have a header line you probably don't want to validate it. Let's parse that statement. You may want to validate that the headers are, say, in a certain order, all-caps, from a certain list of names, etc. But you don't have to do that one the 0th line. You can declare those rules and they will take affect, just as long as any one or more lines are considered for matching. At the same time, while your header rules won't trouble other data lines, the reverse may not be true. If you expect the &lt;code&gt;item count&lt;/code&gt; header to have an integer that you want to sum, you probably don't want to see an error on line 0 because &lt;code&gt;item count&lt;/code&gt; is not an integer.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Variable existence tests vs the &lt;code&gt;asbool&lt;/code&gt; qualifier
&lt;/h3&gt;

&lt;p&gt;When you have a variable all by its self it is a test for the existence of the variable. Variables are words with an &lt;code&gt;@&lt;/code&gt; in front, &lt;code&gt;@firstname&lt;/code&gt;. A variable is created the first time it is used. But saying &lt;code&gt;@firstname&lt;/code&gt; by itself doesn't create the variable, rather it returns true if &lt;code&gt;@firstname&lt;/code&gt; exists, otherwise false. &lt;/p&gt;

&lt;p&gt;This can be confusing if you have a variable with a boolean value where you want the variable to vote on the line matching. We have that in the example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    not( @price_format.asbool )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the price format is &lt;code&gt;False&lt;/code&gt; we want to do something. But, if we just do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    @price_format
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We aren't voting to match based on the correctness of the price format. We're instead voting to match based on the existence of the price format variable. In order to vote with the value of the variable we have you use the &lt;code&gt;asbool&lt;/code&gt; qualifier. A qualifier modifies how a match component behaves. In this case we're saying "pay attention to my value, not to if I exist".&lt;/p&gt;

&lt;h3&gt;
  
  
  Side-effects vs. match determiners
&lt;/h3&gt;

&lt;p&gt;At a high level, a match component can do one of three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Determine if lines match (a.k.a. vote on matching)&lt;/li&gt;
&lt;li&gt;Return a calculated value&lt;/li&gt;
&lt;li&gt;Do some other thing that doesn't provide a vote or a value&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The latter is called a side-effect. An obvious example is &lt;code&gt;print()&lt;/code&gt;. You can print without voting on a line matching or not. And &lt;code&gt;print()&lt;/code&gt; doesn't return a value that can be used in an assignment or existence test.&lt;/p&gt;

&lt;p&gt;An assignment is a side-effect. For e.g.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@in = in( #category, "OFFICE|COMPUTING|FURNITURE|PRINT|FOOD|OTHER" )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This match component is an assignment of the value produced by &lt;code&gt;in()&lt;/code&gt; to the variable &lt;code&gt;@in&lt;/code&gt;. You might think that either or both sides of a when/do expression would be the same. A when/do is an expression where a left-side match component is separated from a right-side match component by a &lt;code&gt;-&amp;gt;&lt;/code&gt; operator. When/do is CVL's if/then statement. This guess would be reasonable, but wrong. A when/do operation allows both sides to vote. That means that:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    not( @price_format.asbool ) -&amp;gt;
    error.price("Bad price $.headers.'a price' at line  $.csvpath.count_lines", fail())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The left-hand side is voting to match or not based on the boolean value of the &lt;code&gt;@price_format&lt;/code&gt; variable. At the same time, &lt;code&gt;error.price()&lt;/code&gt; is not voting because an &lt;code&gt;error()&lt;/code&gt; is a side-effect. To be more specific, &lt;code&gt;not()&lt;/code&gt; both generates a value and also votes on matches; &lt;code&gt;error()&lt;/code&gt; does neither.&lt;/p&gt;

&lt;p&gt;With hundreds of functions, CVL demands a quick function lookup with detailed documentation. Luckily you get that in CsvPath Framework's CLI (just do &lt;code&gt;poetry run cli&lt;/code&gt;) and/or look in &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Data&lt;/a&gt; in the language helper window that opens on the right when you are editing a csvpath file.&lt;/p&gt;

&lt;h3&gt;
  
  
  Assignments don't determine matches
&lt;/h3&gt;

&lt;p&gt;We've just covered this point-to-remember above, but it bears repeating: an &lt;code&gt;@x = "true"&lt;/code&gt; expression does not vote on if a line matches. In this case we only set &lt;code&gt;@x&lt;/code&gt; to the value &lt;code&gt;"true"&lt;/code&gt;, nothing more. We can subsequently use &lt;code&gt;@x&lt;/code&gt; to vote using either its existence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   @x  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or its value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   @x.asbool 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To (hopefully!) be over-clear, this csvpath:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   @x = "true" @x @x.asbool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;has 3 match components and will always match every line because not only does &lt;code&gt;@x&lt;/code&gt; exist, but &lt;code&gt;@x&lt;/code&gt; is also &lt;code&gt;true&lt;/code&gt;, which means &lt;code&gt;@x.asbool&lt;/code&gt; evaluates to a Python &lt;code&gt;True&lt;/code&gt; under the hood based on the &lt;code&gt;"true"&lt;/code&gt; value of &lt;code&gt;@x&lt;/code&gt;. &lt;/p&gt;

&lt;h3&gt;
  
  
  Header existence tests
&lt;/h3&gt;

&lt;p&gt;Similar to the point I just made above. A header existence test is like &lt;code&gt;#SKU&lt;/code&gt;. That means the &lt;code&gt;#SKU&lt;/code&gt; header evaluates to true when a line has a value for SKU; otherwise, false. In the example, we aren't checking what the value is, we only care that there is a value. If we needed the value of &lt;code&gt;#SKU&lt;/code&gt; to be treated as a boolean we would again use the &lt;code&gt;asbool&lt;/code&gt; qualifier, but in this case we do not need that. &lt;/p&gt;

&lt;p&gt;The example uses OR logic and we want to collect all the lines, valid or not, so I have both the rule predicate &lt;code&gt;not(#SKU)&lt;/code&gt; and the existence test &lt;code&gt;#SKU&lt;/code&gt;. The first determines if we get a validation error message and collects lines that are missing SKUs, and the second collects the line if there is a SKU, so that we get all the lines. &lt;/p&gt;

&lt;p&gt;An improvement to the example might be to remove the two match components &lt;code&gt;#SKU&lt;/code&gt; and &lt;code&gt;#UPC&lt;/code&gt; and instead add &lt;code&gt;yes()&lt;/code&gt;. &lt;code&gt;yes()&lt;/code&gt; always votes to match. In OR we only need one positive vote, so &lt;code&gt;yes()&lt;/code&gt; collects every line. The reason this would be an improvement is that the single &lt;code&gt;yes()&lt;/code&gt; replaces two match components, and fewer is generally better. And, because, as written, we collect all lines because of the SKU and UPC match components, but we don't follow the same pattern with price and category. It would be nice to use the same pattern for all four rules, even if either the SKU or UPC match components collect all the lines for everyone. Using a &lt;code&gt;yes()&lt;/code&gt; separates the line collection from the validation rules and makes all the rules behave the same, while at the same time simplifying the csvpath a little. &lt;/p&gt;

&lt;h3&gt;
  
  
  Add names to errors
&lt;/h3&gt;

&lt;p&gt;It's important to remember that &lt;code&gt;print()&lt;/code&gt; and &lt;code&gt;error()&lt;/code&gt; are essentially the same thing, with one super important difference &lt;code&gt;error()&lt;/code&gt; creates error events. Both functions print messages to printstreams, optionally using print references to include potentially detailed metadata. But only &lt;code&gt;error()&lt;/code&gt; throws off JSON error events that are captured to the &lt;code&gt;errors.json&lt;/code&gt; and available programmatically. And the error messages you create using &lt;code&gt;error()&lt;/code&gt; are printed in error-formatted lines.&lt;/p&gt;

&lt;p&gt;An error-formatted line is like log output that is formatted using a log template. You can set the error template in &lt;code&gt;config.ini&lt;/code&gt; or the config panel in &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Data&lt;/a&gt;. Your config can choose to print "bare" errors or "full". A bare error doesn't use the error template at all; instead, the message is the whole output to the printstream. The default error message template is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{time}:{file}:{line}:{paths}:{instance}:{chain}:  {message}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now comes the point. If you add names to your &lt;code&gt;error()&lt;/code&gt; functions, like&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;error.price("Bad price $.headers.price at line $.csvpath.count_lines")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get a full-format error message that has a clear ID chain. (Assuming your template includes the &lt;code&gt;{chain}&lt;/code&gt;. What does this mean?&lt;/p&gt;

&lt;p&gt;An ID chain is the path-like identifier of each match component, scoped within an expression. Each match component in your csvpaths is, potentially, composed of other match components in a tree structure. Each top-level match component is considered an expression. (The component structure is held in an Expression object, even if we rarely speak of it). &lt;code&gt;$[*][#a #b @c]&lt;/code&gt; is a csvpath with 3 expressions, each of which has one component. &lt;code&gt;$[*][add(#a,#b,@c)]&lt;/code&gt; is a csvpath with 1 expression that has 4 components.&lt;/p&gt;

&lt;p&gt;Within each expression the name-by-name path, parent to child, leading to a specific match component is its ID chain. In an error line the prefix before the message might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    2025-12-14 04h18m59s-936643:March-2024.csv:11:::category 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here &lt;code&gt;category&lt;/code&gt; is what we declared as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    not( @in.asbool ) -&amp;gt;
       error.category( "Bad category $.headers.category at line $.csvpath.count_lines ", fail())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, this simple example doesn't show the power of the ID chain. Let's make a contrived example that helps better show it. If we change the error category match component to nest another child match component, while removing the &lt;code&gt;category&lt;/code&gt; name on the &lt;code&gt;error()&lt;/code&gt;, like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    not( @in.asbool ) -&amp;gt;
       error( "Bad category $.headers.category at line $.csvpath.count_lines ", error("just testing", fail()))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error message we get from the nested error is this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    2025-12-14 04h20m04s-975094:March-2024.csv:11:::error[1].error[1]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That says that the first error child of an error raised an error event. We can see what that means because it's a simple example. But since ID chains are always scoped to the expression, not the csvpath statement as a whole, what happens when we have very similar error checks for SKUs and UPCs? &lt;code&gt;error[1].error[1]&lt;/code&gt; doesn't feel so informative. And these rules are simple; in more complex situations ID chains are both more important and, potentially, harder to read.&lt;/p&gt;

&lt;p&gt;Just adding the names to the functions makes all the difference to interpreting where the error occurred. Let's change the expression to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    not( @in.asbool ) -&amp;gt;
       error.category( "Bad category $.headers.category at line $.csvpath.count_lines ", error.testing-idchains("just testing", fail()))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error message of the 2nd &lt;code&gt;error()&lt;/code&gt; becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2025-12-14 04h33m20s-635277:March-2024.csv:11:::category.testing-idchains
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's much more readable! Now our UPC, SKU, price, and category errors are much clearer, making error investigation just a bit less frustrating and a bit more quick.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is &lt;code&gt;fail()&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;There are many approaches to validation using any competent validation language in any non-trivial use case. CsvPath Validation Language is no exception, and in fact, due to the quirks of CSV and tabular data file processing CVL has more options than most. Among these, the big ones are: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Collect matching lines as the valid data&lt;/li&gt;
&lt;li&gt;Collect matching lines as the invalid data&lt;/li&gt;
&lt;li&gt;Print or &lt;code&gt;error()&lt;/code&gt; to indicate errors, without collecting data or making a ruling on the file as a whole&lt;/li&gt;
&lt;li&gt;Pass or file a data file as a whole based on what rules, schemas, or built-in validations passed or failed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The last option, marking the data file invalid based on rules and/or schemas, relies on the &lt;code&gt;fail()&lt;/code&gt; function and the settings for error handling in &lt;code&gt;config.ini&lt;/code&gt; and &lt;code&gt;validation-mode&lt;/code&gt; to set the run's &lt;code&gt;is_valid&lt;/code&gt; field. &lt;code&gt;is_valid&lt;/code&gt; is visible and/or accessible in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run metadata JSON&lt;/li&gt;
&lt;li&gt;Print and error function references (e.g. &lt;code&gt;$.csvpath.valid&lt;/code&gt;) &lt;/li&gt;
&lt;li&gt;The &lt;code&gt;valid()&lt;/code&gt; and &lt;code&gt;failed()&lt;/code&gt; functions &lt;/li&gt;
&lt;li&gt;Programmatically on the CsvPath and Result objects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Depending on your config and &lt;code&gt;validation-mode&lt;/code&gt; setting, if any, built-in errors can automatically fail a run. In some cases, though, you may want to be more deliberate in failing your runs based on certain rules, but allowing others to just be informational. For instance, you might want to flag a date format error, but then simply upgrade the data, rather than failing the run for a very correctable problem. At the same time, if the date was empty but expected, you might want to use &lt;code&gt;fail()&lt;/code&gt; mark the run and data file invalid because you found an uncorrectable, unignorable problem. &lt;/p&gt;

&lt;h3&gt;
  
  
  Use &lt;code&gt;"&lt;/code&gt; on headers in body, use &lt;code&gt;'&lt;/code&gt; on headers in &lt;code&gt;print()&lt;/code&gt; and &lt;code&gt;error()&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;And finally, an easy one! We use &lt;code&gt;#"my header name with spaces"&lt;/code&gt; to refer to a header named &lt;code&gt;my header name with spaces&lt;/code&gt;. At least, that's how we do it in the world of header match components. Ah, but the world of &lt;code&gt;print()&lt;/code&gt; and &lt;code&gt;error()&lt;/code&gt; messages is a bit different. You see the problem, right? A print or error message is wrapped in quotes, so it's not possible for a header reference to use quotes. &lt;/p&gt;

&lt;p&gt;Instead, within the print or error message we simply use single quotes for header names that need to be quoted. This really isn't a big ask. Here is our example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#"a price"
not( @price_format.asbool ) -&amp;gt;
    error.price("Bad price $.headers.'a price' at line  $.csvpath.count_lines", fail())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we are testing if the &lt;code&gt;#"a price"&lt;/code&gt; header exists; presumably to make sure we match lines with prices. Then we're raising an error if the &lt;code&gt;@price_format&lt;/code&gt; variable is &lt;code&gt;False&lt;/code&gt;. In the &lt;code&gt;error()&lt;/code&gt; message we refer to the same header as &lt;code&gt;$.headers.'a price'&lt;/code&gt;. Not too much to ask. But it is one more one-little-thing to remember.&lt;/p&gt;

&lt;h2&gt;
  
  
  So, What Is CsvPath Validation Language: Easy? Hard?
&lt;/h2&gt;

&lt;p&gt;I hope this post helps call out some of the knowledge that can help you create effective validation and upgrading scripts. There is, of course, much more to unpack. But the real question here isn't how much or how little, how hard or how easy. The real question is, &lt;em&gt;is it worth it&lt;/em&gt;? And the test of that for many people is: &lt;em&gt;does the solution make easy stuff easy and hard stuff possible?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;By that measure, CsvPath Validation Language and the whole &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt; scores well. &lt;/p&gt;

</description>
      <category>datascience</category>
      <category>dataengineering</category>
      <category>datastructures</category>
      <category>programming</category>
    </item>
    <item>
      <title>How Many Ways Can A CSV File Hold Multiple Entities?</title>
      <dc:creator>David Kershaw</dc:creator>
      <pubDate>Sat, 13 Dec 2025 05:36:49 +0000</pubDate>
      <link>https://forem.com/david_kershaw_b6916404da6/how-many-ways-can-a-csv-file-hold-multiple-entities-1i3i</link>
      <guid>https://forem.com/david_kershaw_b6916404da6/how-many-ways-can-a-csv-file-hold-multiple-entities-1i3i</guid>
      <description>&lt;h2&gt;
  
  
  Focus On the Structure
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fphy21r8ncwni6xyhgmgd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fphy21r8ncwni6xyhgmgd.png" alt=" " width="800" height="700"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  CSV, Excel, and JSONL need more validation love
&lt;/h2&gt;

&lt;p&gt;Tabular data can be handled with as much rigor, quality, and efficiency as document-form, containment models (e.g. JSONSchema, XSD) or tables based data (e.g. SQL). And it should be! &lt;/p&gt;

&lt;p&gt;As one example, &lt;a href="https://github.com/csvpath/csvpath" rel="noopener noreferrer"&gt;CsvPath Validation Language&lt;/a&gt; can accommodate multiple entities living in a single tabular document using a sophisticated schema syntax as capable as DDL. In fact, it can do this at least four different ways. Each of them common enough in the wild.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Mixed parent-child relationships by line adjacency and order&lt;/li&gt;
&lt;li&gt;Multiple entities' data lines grouped one entity after another&lt;/li&gt;
&lt;li&gt;Entities side-by-side, line-by-line&lt;/li&gt;
&lt;li&gt;Entities in sub-table like clusters organized visually and floating in a tabular landscape (e.g. Excel files with ancillary table/boxes for sub-calculations, dimensions, etc.) &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The image above outlines each of these. Can you think of more ways to position multiple entities in a tabular CSV, Excel, or JSONL data file?&lt;/p&gt;

&lt;p&gt;Have a look at &lt;a href="https://blog.csvpath.org/do-schemas-have-a-place-in-delimited-data" rel="noopener noreferrer"&gt;this article on CSV validation schemas&lt;/a&gt; to see the syntax and some approaches used in the validation part of &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework's data preboarding architecture&lt;/a&gt;.   &lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>datascience</category>
      <category>programming</category>
      <category>csv</category>
    </item>
    <item>
      <title>SQL: Doing GROUP BY in CsvPath</title>
      <dc:creator>David Kershaw</dc:creator>
      <pubDate>Thu, 04 Dec 2025 01:28:14 +0000</pubDate>
      <link>https://forem.com/david_kershaw_b6916404da6/sql-doing-group-by-in-csvpath-2ff1</link>
      <guid>https://forem.com/david_kershaw_b6916404da6/sql-doing-group-by-in-csvpath-2ff1</guid>
      <description>&lt;p&gt;Let's look at how to create a simple &lt;code&gt;GROUP BY&lt;/code&gt; report in &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt;'s tabular data validation language. Of all our examples, this is an easy one!&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;GROUP BY&lt;/code&gt; query is straightforward. It selects rows and groups them according to one or more columns. The archetypal example is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT  
  dept, 
  role, 
  SUM(salary) total_salary
FROM    employee
GROUP BY dept, role
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query produces a result set with three columns. The first two are a unique combination of the &lt;code&gt;dept&lt;/code&gt; and &lt;code&gt;role&lt;/code&gt; columns and the third is a summation of the salaries in the rows with that combination of department and role.&lt;/p&gt;

&lt;p&gt;For our example we'll use &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Data&lt;/a&gt;'s built-in examples because they are handy. Every time you create a project FlightPath creates an &lt;code&gt;examples&lt;/code&gt; folder with a directory tree of useful how-to examples. In this case we'll use &lt;code&gt;examples/counting/projects.csv&lt;/code&gt; as the data and create a new csvpath for our work. To create your new csvpaths file just right-click on the &lt;code&gt;counting&lt;/code&gt; directory and select &lt;code&gt;New file&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;What we need is to replicate the calculation in the SQL and at the end of the data file output a table. That's the easiest way to make an analog to SQL's &lt;code&gt;GROUP BY&lt;/code&gt;. For this example we prefer to output at the end of the run because CsvPath works line-by-line and we don't care about intermediate results. &lt;/p&gt;

&lt;p&gt;As an aside, a SQL database will likely not have to do a table scan for the &lt;code&gt;GROUP BY&lt;/code&gt;, so the database will have an speed advantage in large data situations. CsvPath is built for automation and typically does somewhat different kinds of processing than most SQL, including things that require the line-by-line approach. In most cases, unless you're working with gigabytes of raw data the real-time performance is fine. Either way, once automated you'll be off doing something else.&lt;/p&gt;

&lt;p&gt;First scaffold your csvpath. FlightPath may have already done this for you.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~ test-data:examples/counting/projects.csv ~
$[*][

]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So far, this says scan the whole file (&lt;code&gt;*&lt;/code&gt;) and do nothing. The test file is indicated in &lt;code&gt;test-data&lt;/code&gt;. In production that directive will be ignored, but within FlightPath Data it keeps us from having to select a data file for each test run.&lt;/p&gt;

&lt;p&gt;Next let's do the simplest thing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~ test-data:examples/counting/projects.csv ~
$[*][
    subtotal.worker_hours(#agency, #13)
    last() -&amp;gt; var_table("worker_hours")
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These two lines say that we want to subtotal the values in header &lt;code&gt;#13&lt;/code&gt; (0-based) for each different value of the &lt;code&gt;#agency&lt;/code&gt; header. Header &lt;code&gt;#13&lt;/code&gt; is a.k.a. &lt;code&gt;#worker_hours_this_period&lt;/code&gt;. Sometimes you need to use a header's index number for a practical reason. In this case &lt;code&gt;#13&lt;/code&gt; is just less typing. We're giving the subtotal variable a name that is meaningful to us. That could help if we were going to use &lt;code&gt;subtotal()&lt;/code&gt; more than once in a csvpath.&lt;/p&gt;

&lt;p&gt;Then we have the &lt;code&gt;last()&lt;/code&gt; function. &lt;code&gt;last()&lt;/code&gt; matches only the very last line in the data file. When &lt;code&gt;last()&lt;/code&gt; matches we are going to do whatever is on the right-hand side of the &lt;code&gt;-&amp;gt;&lt;/code&gt;. That symbol, &lt;code&gt;-&amp;gt;&lt;/code&gt;, is the when/do operator. It says, when a thing is true, do this other thing. It is CsvPath Validation Language's version of the if/than statement.&lt;/p&gt;

&lt;p&gt;What we want to do is to output a table of worker's hours by agency. We have the data in a variable called &lt;code&gt;@worker_hours&lt;/code&gt; that was created by &lt;code&gt;subtotal()&lt;/code&gt;. Variables are addressed using the &lt;code&gt;@&lt;/code&gt;, just like header values are addressed using the &lt;code&gt;#&lt;/code&gt; sign. &lt;/p&gt;

&lt;p&gt;Many functions generate data in variables, often so you can use it right there within the csvpath. In FlightPath Data you can see your test run's variables in the Variables tab. In production you would see your variables in the run's directory in &lt;code&gt;vars.json&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;So then, on the last line our csvpath will call the &lt;code&gt;var_table()&lt;/code&gt; function, passing it the worker_hours variable by name. This function is part of the print functions group, if you're looking at the help window. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffdoluk1rwr1x2vm908r8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffdoluk1rwr1x2vm908r8.png" alt="The FlightPath Data help window" width="658" height="1016"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;var_table()&lt;/code&gt; outputs a text table using whatever variable you give it. In our case, a dictionary structure of tracked values. This is the table we see: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4zxk771yimggkzu04a5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4zxk771yimggkzu04a5.png" alt="The validation report from our CSV file using CsvPath" width="800" height="137"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pretty neat! And simple, only two lines.&lt;/p&gt;

&lt;p&gt;However, our canonical &lt;code&gt;GROUP BY&lt;/code&gt; example at the top used two fields, not just one. Let's do the same here.&lt;/p&gt;

&lt;p&gt;This time we're going to create a key for our subtotal. Our key uses the columns we're interested in and sums the values of the workers' hours accordingly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiiby8d283cm5bom0srt0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiiby8d283cm5bom0srt0.png" alt="The full CsvPath example for validating a CSV file" width="800" height="255"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We're adding just:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    @key = concat(#agency, ", ", #neighborhood)
    subtotal.Aggregate_hours(@key, #13)
    last() -&amp;gt; var_table("Aggregate_hours")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see there isn't much difference. The key is all we need to distinguish the lines we want to group. There are other ways to achieve this outcome, but this is a simple one. The result looks like this: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84m6x9zjuvbtmnlem447.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84m6x9zjuvbtmnlem447.png" alt="The validation report output from CsvPath for a CSV file" width="800" height="771"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This output isn't super helpful for automated production testing. However, the data that goes to &lt;code&gt;vars.json&lt;/code&gt; would be. Nevertheless, this is a useful validation in another sense. Often times data managers are looking for a "report-form" validation. I.e. they want written complaints about the data, if there are any problems. The ISO Schematron standard for XML validation is an example of this approach. It makes it easy to associate XPath statements with human readable user-defined errors.&lt;/p&gt;

&lt;p&gt;In the case of CsvPath, as I said above, our focus, first and foremost, is on lights-out automation. To that end, we offer the &lt;code&gt;errors.json&lt;/code&gt; file to collect built-in validation errors and the &lt;code&gt;error()&lt;/code&gt; function for csvpath writers to create their own custom errors feeding into &lt;code&gt;errors.json&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;However, CsvPath also has a sophisticated printouts subsystem that allows you to print to multiple printstreams from within your csvpath. That let's you be informational, display errors, and separate different run information for different purposes and/or audiences that may have different validation needs. There's lots to dig into in that vein. &lt;/p&gt;

&lt;p&gt;For now, though, we have what we came for: a shift-left printout validation upstream of the database doing the same thing as &lt;code&gt;GROUP BY&lt;/code&gt; but closer to the source of data errors where fixes cost less time, money, and hair-pulling. Not bad at all.&lt;/p&gt;

&lt;p&gt;If you want to learn more about CsvPath Framework and tabular data automation check out &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;https://www.csvpath.org&lt;/a&gt;. For a closer look at the validation language I'd suggest grabbing a (free!) copy of &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Data&lt;/a&gt; and looking at the examples. The &lt;a href="https://github.com/csvpath/csvpath" rel="noopener noreferrer"&gt;Github repo&lt;/a&gt; is also helpful. &lt;/p&gt;

</description>
      <category>sql</category>
      <category>dataengineering</category>
      <category>csv</category>
      <category>csvpath</category>
    </item>
    <item>
      <title>Comparing Great Expectations and CsvPath Framework</title>
      <dc:creator>David Kershaw</dc:creator>
      <pubDate>Sun, 30 Nov 2025 06:46:55 +0000</pubDate>
      <link>https://forem.com/david_kershaw_b6916404da6/comparing-great-expectations-and-csvpath-framework-3lfg</link>
      <guid>https://forem.com/david_kershaw_b6916404da6/comparing-great-expectations-and-csvpath-framework-3lfg</guid>
      <description>&lt;p&gt;Today we're going to take a swing at translating a Great Expectations Python script into &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt;. The script comes from the GE documentation site. First a bit about the tools.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://greatexpectations.io/" rel="noopener noreferrer"&gt;Great Expectations&lt;/a&gt; is a data quality tool for live production pipeline data checking. You know, the thing we all know we should be doing but mostly aren't. GE comes as a core expectations library and a paid SaaS service that adds teamwork and visualization. The expectations are essentially data quality rules realized in Python functions. Each function contains the logic of the test and the hooks for it to work within the framework's larger context. &lt;/p&gt;

&lt;p&gt;Like CsvPath Framework, GE brings together data sources, data rules, and context to generate metadata. However, Great Expectations is a quality control toolkit, without any of the data management tooling CsvPath brings. &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt; is for preboarding tabular data files as they enter the enterprise. Where GE leans towards the relational database world, CsvPath is fully focused on edge governance of file feeds. Both tools capture metadata and throw off validation events. GE stays largely within its SaaS service. CsvPath instead supports to widely supported OTLP and OpenLineage protocols.&lt;/p&gt;

&lt;p&gt;Other differences include the validation approach. Great Expectations  stays in the world of Python. It does not have a schema language or support one, other than SQL. Most of the lifting is done by the packaged and custom "expectations" functions. CsvPath Framework, on the other hand, is a full architecture for preboarding that has a core competency in data quality and data upgrading. It has a schema language for tabular data, CsvPath Validation Language, that is analogous to XSD, DDL, Schematron, or JSON Schema. CSV and Excel finally have someone taking them seriously.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Takes On Schema Validation
&lt;/h2&gt;

&lt;p&gt;The example we're going to look at is from &lt;a href="https://docs.greatexpectations.io/docs/reference/learn/data_quality_use_cases/schema#strict-vs-relaxed-schema-validation" rel="noopener noreferrer"&gt;https://docs.greatexpectations.io/docs/reference/learn/data_quality_use_cases/schema#strict-vs-relaxed-schema-validation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It gives a validation script that does four things: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A "strict" validation of the columns of a single table in an RDBMS&lt;/li&gt;
&lt;li&gt;A "relaxed" validation of the columns of the same table&lt;/li&gt;
&lt;li&gt;One common type check&lt;/li&gt;
&lt;li&gt;Another rule for variation in allowed types in a column&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While this is a relational database example, Great Expectations' forte, it is a valid comparison since GE only checks one table and in ways that would apply equally well to an incoming tabular file. CsvPath Framework can use databases to store its metadata, but it doesn't validate database data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup Boilerplate
&lt;/h2&gt;

&lt;p&gt;Great Expectations' script has a lot of boilerplate. As you might expect with the modest list above, the setup code is the largest part. &lt;/p&gt;

&lt;p&gt;In contrast, &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt; offers &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Data&lt;/a&gt; as a no-setup-required GUI environment and/or &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Server&lt;/a&gt; as a set of very lightweight JSON REST endpoints that minimize setup boilerplate down to webhook-able simplicity. There is also a CLI that makes running a CsvPath script a no-setup thing.&lt;/p&gt;

&lt;p&gt;However, for the comparison I'll give the equivalent Python setup. First the Great Expectations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import great_expectations as gx
import great_expectations.expectations as gxe

context = gx.get_context()
# Create Data Source, Data Asset, and Batch Definition.
# CONNECTION_STRING contains the connection string for the Postgres database.
datasource = context.data_sources.add_postgres(
    "postgres database", connection_string=CONNECTION_STRING
)

data_asset = datasource.add_table_asset(name="data asset", table_name="transfers")

batch_definition = data_asset.add_batch_definition_whole_table("batch definition")

batch = batch_definition.get_batch()

# Create Expectation Suite with strict type and column Expectations. Validate data.
strict_suite = context.suites.add(gx.ExpectationSuite(name="strict checks"))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GE's own comments are probably sufficient to explain how the library's pieces are being marshaled prior to the validation run.&lt;/p&gt;

&lt;p&gt;Now CsvPath Framework's analog&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from csvpath import CsvPaths
paths = CsvPaths()
paths.file_manager.add_named_file(name="transfers", path="s3://mybucket/2025-may-transfers.csv")
paths.paths_manager.add_named_paths_from_file(name="transfers", file_path="scripts/transfers.csvpaths")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What we did was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We created an instance of CsvPaths, the class that runs sets of scripts against versions of files&lt;/li&gt;
&lt;li&gt;Next we added a physical file as a new version of a logical file named "transfers". I'm pulling it from s3, but it could be in any of the Framework's storage backends.&lt;/li&gt;
&lt;li&gt;Then we added a set of one or more csvpath statements as a named set of validations in a local file.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's not a lot of setup. Granted, you can do a lot more when you need to; there are many options. But also remember that even that small amount of Python is optional. FlightPath Data, FlightPath Server, and the CsvPath CLI are here for you.&lt;/p&gt;

&lt;p&gt;That's it, both Great Expectations and CsvPath Framework are now ready to validate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Validations
&lt;/h2&gt;

&lt;p&gt;We'll show two validations, both variations on a theme. Basically we check data order and the type of a column.&lt;/p&gt;

&lt;p&gt;The first validation is a strict check on columns/headers. &lt;em&gt;(I'll go into why CsvPath Framework refers to "headers" rather than "columns" another time; there's a good reason, but for now just go with it.)&lt;/em&gt; They must match a provided list. And the &lt;code&gt;transfer_amount&lt;/code&gt; column/header must have double precision values. We're in the land of Python, so from that perspective we're talking about &lt;code&gt;float&lt;/code&gt; values. &lt;/p&gt;

&lt;h3&gt;
  
  
  GE's Validation
&lt;/h3&gt;

&lt;p&gt;This is Great Expectation's example, so let's look at what they are doing. Here's the code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;strict_suite.add_expectation(
    gxe.ExpectTableColumnsToMatchOrderedList(
        column_list=[
            "type",
            "sender_account_number",
            "recipient_fullname",
            "transfer_amount",
            "transfer_date",
        ]
    )
)
strict_suite.add_expectation(
    gxe.ExpectColumnValuesToBeOfType(column="transfer_amount", type_="DOUBLE PRECISION")
)
strict_results = batch.validate(strict_suite)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  CsvPath Framework's Validation
&lt;/h3&gt;

&lt;p&gt;CsvPath does its validation and upgrading work in CsvPath Validation Language. The language is concise and function-specific. Interestingly, because the language is purpose-built it offers multiple ways to attack the problem. Let's do the most exact match to the GE version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$[*][ 
header_names_match("type|sender_account_number|recipient_fullname|transfer_amount|transfer_date")
float(#transfer_amount)
]        
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We start with the scanning instruction. In this case we want to check all lines so we just pass &lt;code&gt;*&lt;/code&gt;. Then come the functions in the matching part of the statement.&lt;/p&gt;

&lt;p&gt;These two functions are called match components. They are ANDed together (by default, but if needed we can OR). If both evaluate to True the line being considered is a match. In our validation strategy, lines that match are valid. As you can probably tell, this is virtually an exact match to the GE solution.&lt;/p&gt;

&lt;p&gt;Is it the best way, though? Honestly, it is fine. But personal preference weighs in. I have to give you the option I would take.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$[*][
   line(
      blank(#type),
      blank(#sender_account_number),
      string(#recipient_fullname),
      float(#transfer_amount),
      date(#transfer_date)
   )
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, the &lt;code&gt;#&lt;/code&gt; indicates a header, the CSV equivalent of a database column. &lt;/p&gt;

&lt;p&gt;For me, that's a more readable structure. It also gives a bit more type information than we require; however, it's pretty easy to guess string and date for &lt;code&gt;recipient_fullname&lt;/code&gt; and &lt;code&gt;transfer_date&lt;/code&gt;, respectively. I also used the &lt;code&gt;blank()&lt;/code&gt; type to assign the header names in positions where I couldn't guess the data types.&lt;/p&gt;

&lt;p&gt;Next let's go back to Great Expectations. We're going to slightly update the first validation to do a more relaxed version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;relaxed_suite = context.suites.add(gx.ExpectationSuite(name="relaxed checks"))
relaxed_suite.add_expectation(
    gxe.ExpectTableColumnsToMatchSet(
        column_set=[
            "type",
            "sender_account_number",
            "transfer_amount",
            "transfer_date",
        ],
        exact_match=False,
    )
)
relaxed_suite.add_expectation(
    gxe.ExpectColumnValuesToBeInTypeList(
        column="transfer_amount", type_list=["DOUBLE PRECISION", "STRING"]
    )
)
relaxed_results = batch.validate(relaxed_suite)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we're allowing the columns to be any order, but they all must be present and no additional columns added. We're also letting the &lt;code&gt;transfer_amount&lt;/code&gt; column now be either a float or a string. Nothing complicated.&lt;/p&gt;

&lt;p&gt;Let's look at the same in CsvPath.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$[*][ 
header_names_match.nocontrib.m("type|sender_account_number|recipient_fullname|transfer_amount|transfer_date")

sum(@m_present, @m_misordered) == count_headers()

or(
 float(#transfer_amount),
 string(#transfer_amount)
]        
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This time the CsvPath statement is a bit more verbose. Here's what's happening.&lt;/p&gt;

&lt;p&gt;We use the &lt;code&gt;header_names_match()&lt;/code&gt; function again. We don't care if there isn't a strictly ordered match so we add the &lt;code&gt;nocontrib&lt;/code&gt; qualifier. Qualifiers modify the behavior of match components. In this case we're telling &lt;code&gt;header_names_match()&lt;/code&gt; to not contribute to the determination of if a line matches. We also add an &lt;code&gt;m&lt;/code&gt; just to give a simpler name to the backing variables the function creates.&lt;/p&gt;

&lt;p&gt;We use those backing variables, specifically &lt;code&gt;@m_present&lt;/code&gt; and &lt;code&gt;@m_misordered&lt;/code&gt;, to check that we have all the headers and no additional headers. &lt;code&gt;header_names_match()&lt;/code&gt; also creates a count of unmatched headers and duplicated headers, but we don't need those. &lt;/p&gt;

&lt;p&gt;Finally, we change the type declaration from &lt;code&gt;float()&lt;/code&gt; to a logical structure that accepts either a float or a string. This doesn't work in the &lt;code&gt;line()&lt;/code&gt; schema form, but it works great as a validation rule.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results and Metadata
&lt;/h2&gt;

&lt;p&gt;The last thing we want to check is... how did we do? Is our data valid?&lt;/p&gt;

&lt;p&gt;Great Expectations does:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(f"Strict validation passes: {strict_results['success']}")
print(f"Relaxed validation passes: {relaxed_results['success']}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That works fine for purposes of example. &lt;/p&gt;

&lt;p&gt;On the CsvPath Framework side, we have more choices to make. Generally we don't go with something like what the GE example shows, not when we're using the Framework to its fullest. &lt;/p&gt;

&lt;p&gt;Let me back-track and say that we could have done a much simpler CsvPath run like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from csvpath import CsvPath
path = CsvPath()
path.fast_forward(f"""
${transfer.csv}[*][
   line(
      blank(#type),
      blank(#sender_account_number),
      string(#recipient_fullname),
      float(#transfer_amount),
      date(#transfer_date)
   )
]
""")
print(f"Any errors? {path.has_errors}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's everything you need to match the GE example. &lt;/p&gt;

&lt;p&gt;What I actually setup, though, was a more robust preboarding automation-friendly harness. It was similar to what you'd use in production, and to what FlightPath Server does behind the scenes. In that world you work with &lt;code&gt;Result&lt;/code&gt; objects. Results give much more metadata than you get from Great Expectations; often times (hopefully!) more than you need. &lt;/p&gt;

&lt;p&gt;Going back to the original way we setup CsvPath, accessing results to check validity and errors looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from csvpath import CsvPaths

paths = CsvPaths()
paths.file_manager.add_named_file(name="transfers", path="s3://mybucket/2025-may-transfers.csv")
paths.paths_manager.add_named_paths_from_file(name="transfers", file_path="scripts/transfers.csvpaths")
ref = paths.fast_forward_paths(filename="transfers", pathsname="transfers")

results = paths.results_manager.get_named_results(ref)
for result in results:
    print(f"Csvpath has errors: {result.has_errors}, is valid: {result.is_valid}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Right off the bat you're probably saying why. &lt;/p&gt;

&lt;p&gt;Why are we iterating results? We iterate because we can execute multiple csvpath statements at a time. To do that we would load multiple csvpaths statements under the name "transfers". We didn't go into how to do it. Suffice to say, the easiest way is to just put the statements in the same file separated by &lt;code&gt;---- CSVPATH ----&lt;/code&gt;.  &lt;/p&gt;

&lt;p&gt;And why to we check for errors and validity? Because validation errors is just one possible mark of invalidity. CsvPath Framework considers itself a data preboarding system for files in general, but it takes flat-file validation super seriously, mainly because no one else does. Using CsvPath Validation Language simply is easy, as I hope we have shown. You can, of course, go much further to use it in sophisticated ways that are miles beyond the scope of this post.&lt;/p&gt;

&lt;p&gt;Errors and the &lt;code&gt;is_valid&lt;/code&gt; result overlap but are not identical. To "fail" a file you can call &lt;code&gt;fail()&lt;/code&gt; which makes &lt;code&gt;is_valid&lt;/code&gt; equal &lt;code&gt;False&lt;/code&gt;. Validation errors can also automatically set &lt;code&gt;is_valid&lt;/code&gt; to &lt;code&gt;False&lt;/code&gt;, but that is a configuration choice, not the default. In some cases you might want to instead match on incorrect lines and return them. If you did that your file might be considered invalid if more than 0 lines were returned. In some cases, you might want to take a more Schematron-like approach and simply print built-in and custom error messages as a kind of validation report, rather than relying on a single boolean. There are many options. We're just scratching the surface. Whatever you're trying to do, CsvPath Framework has it covered.&lt;/p&gt;

&lt;p&gt;All that said, we don't need to over-complicate things. We can keep this simple.&lt;/p&gt;

&lt;h2&gt;
  
  
  Net, Net, We Have Validated
&lt;/h2&gt;

&lt;p&gt;Ultimately both Great Expectations and CsvPath Framework validated the data with ease. GE having the advantage with the RDBMS, of course. And CsvPath Framework providing the muscle on the data preboarding side of things. I hope this post convinces you that both tools are worth a closer look!&lt;/p&gt;

</description>
      <category>data</category>
      <category>dataengineering</category>
      <category>csv</category>
      <category>database</category>
    </item>
    <item>
      <title>Comparing CsvPath and SodaCL</title>
      <dc:creator>David Kershaw</dc:creator>
      <pubDate>Fri, 28 Nov 2025 22:10:28 +0000</pubDate>
      <link>https://forem.com/david_kershaw_b6916404da6/comparison-csvpath-vis-a-vis-sodacl-2669</link>
      <guid>https://forem.com/david_kershaw_b6916404da6/comparison-csvpath-vis-a-vis-sodacl-2669</guid>
      <description>&lt;p&gt;Let's have some more fun with comparing and contrasting schema languages. In this post we'll look a schemas + rules-based validation tool, Soda, vis-a-vis &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt;'s &lt;a href="https://github.com/csvpath/csvpath" rel="noopener noreferrer"&gt;CsvPath Validation Language&lt;/a&gt;.  &lt;/p&gt;

&lt;p&gt;SodaCL is the validation rules language for the &lt;a href="https://www.soda.io" rel="noopener noreferrer"&gt;Soda&lt;/a&gt; data quality library. You can learn more at soda.io. I'll say right up front that this is an apples-to-oranges comparison. Here's why: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Soda is mainly relational data focused; CsvPath Framework is mainly files focused&lt;/li&gt;
&lt;li&gt;Soda is a data quality tool; whereas, CsvPath Framework is for data preboarding, which includes data quality, but isn't limited to it&lt;/li&gt;
&lt;li&gt;SodaCL is a domain-specific language built on YAML that uses embedded SQL; CsvPath Validation Language is a first-class stand-alone validation language, more similar in that regard to DDL or XSD&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nevertheless, both tools do flat-file validation, so it is an apt comparison point. As much so as apples and oranges, both being tasty fruit.&lt;/p&gt;

&lt;p&gt;Let's grab a first example from the SodaCL docs and see where it takes us. Please note that these are quick and dirty comparisons. I'm not building the SodaCL or CsvPath for perfection, just giving a rough sense of the differences and similarities.&lt;/p&gt;

&lt;h2&gt;
  
  
  Duplicates
&lt;/h2&gt;

&lt;p&gt;Here's a duplicate rows query check in SodaCL. Even though it is a SQL check it only checks one table so it seems fair game for comparison to a tabular data file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;checks for dim_product:
  - failed rows:
      fail query: |
        with duplicated_records as (
          select
            {{ column_a }},
            {{ column_b }}
          from {{ table }}
          group by {{ column_a }}, {{ column_b }}
          having count(*) &amp;gt; 1
        )
        select
          q.*
        from {{ table }} q
        join duplicated_records dup
          on q.{{ column_a }} = dup.{{ column_a }}
          and q.{{ column_b }} = dup.{{ column_b }}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This test finds and returns rows that have a common column A + column B. In other words, column A with column B act as a meaningful identity, and if we find a duplicate we found an error. As we're using a SELECT the result is the set of every row that has a  duplicate row.&lt;/p&gt;

&lt;p&gt;In CsvPath we would prefer to do something a bit simpler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$[*][
    has_dups(#a, #b)
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This does almost the exact same thing. The result is the duplicate lines, but not the original lines. An original line is the first line of a set of duplicates. &lt;/p&gt;

&lt;p&gt;If we want to know all lines with duplicates, regardless of if they are the original line or not, we can use &lt;code&gt;dup_lines()&lt;/code&gt;. This function returns all the line numbers that are duplicated, including the first. That would net us a variable named &lt;code&gt;@dup_lines&lt;/code&gt; (or whatever we want it to be named). &lt;/p&gt;

&lt;p&gt;The variable would contain a key for every unique value holding a list of line numbers. In order to get the actual lines we would need a second CsvPath that uses the duplicate lines variable to return the all the duplicate lines.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$[*][
   dup_lines.lines(#a, #b)
   no()
]
---- CSVPATH ----
$[*][
   @s = get(
         $dups.variables.lines,
         fingerprint(#a, #b)
        )
   @t = size(@s)
   above(@t, 1)
   append("line", line_number(), yes())
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here the first csvpath creates a variable, &lt;code&gt;@lines&lt;/code&gt;, that has all the unique a+b header value fingerprints as keys to stacks of line numbers where that fingerprint was found. The &lt;code&gt;no()&lt;/code&gt; keeps lines from being collected, since we don't need them.&lt;/p&gt;

&lt;p&gt;We load these two csvpaths as a single named-paths group called &lt;code&gt;dups&lt;/code&gt;. Running the named-paths group serially makes sure the first csvpath has prepared the data that the second needs before the second starts.&lt;/p&gt;

&lt;p&gt;If the &lt;code&gt;@lines&lt;/code&gt; gives a stack for any line with a count above 1 the line has duplicates. Because &lt;code&gt;above()&lt;/code&gt; returns true the line matches and is collected. QED.&lt;/p&gt;

&lt;p&gt;In our example there is no ID that distinguishes lines. In a real case, you might want to have the line numbers so you can better investigate why there were duplicates. As you can see, the  &lt;code&gt;dup_lines()&lt;/code&gt; function captures that for you in a variable. The variable is available programmatically and in the &lt;code&gt;vars.json&lt;/code&gt; file generated by the run. &lt;/p&gt;

&lt;p&gt;However, to stay closer to our working csvpath, we can just add the line number to the lines captured. To do that, we append a new header &lt;code&gt;line&lt;/code&gt;, giving it the value from &lt;code&gt;line_number()&lt;/code&gt;. The &lt;code&gt;yes()&lt;/code&gt; says that we want to include &lt;code&gt;line&lt;/code&gt; as in the header line.&lt;/p&gt;

&lt;p&gt;To strip the CsvPath solution back to essentials that match the SodaCL we get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$[*][
   dup_lines.lines(#a, #b)
]
---- CSVPATH ----
$[*][
   @s = get(
         $dups.variables.lines,
         fingerprint(#a, #b)
        )
   above(size(@s), 1)
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's a nice concise pair of csvpaths. If we were preboarding data with more rules, we would add these two statements to a larger named-paths group covering all the validation rules. &lt;/p&gt;

&lt;p&gt;You can see that for data preboarding, &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt;'s purpose-built CSV validation capabilities are on target. SodaCL, while not a preboarding tool, is also highly effective and obviously a better choice for monitoring the data quality of database-housed data downstream of CsvPath. There's more we can compare between CsvPath and SodaCL. We'll return to it in a future post.&lt;/p&gt;

</description>
      <category>data</category>
      <category>dataengineering</category>
      <category>sql</category>
      <category>csv</category>
    </item>
    <item>
      <title>Let's say you have a data lake</title>
      <dc:creator>David Kershaw</dc:creator>
      <pubDate>Thu, 27 Nov 2025 16:06:11 +0000</pubDate>
      <link>https://forem.com/david_kershaw_b6916404da6/how-to-say-you-have-a-data-lake-1p4l</link>
      <guid>https://forem.com/david_kershaw_b6916404da6/how-to-say-you-have-a-data-lake-1p4l</guid>
      <description>&lt;p&gt;Say you have a data lake. How do you get &lt;em&gt;known-good&lt;/em&gt; CSV or Excel files into it? &lt;/p&gt;

&lt;h2&gt;
  
  
  What About Transfer Mode?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt; has a few options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Locate the archive on the local disk, within the lake&lt;/li&gt;
&lt;li&gt;Use a backend to stream to the lake during processing&lt;/li&gt;
&lt;li&gt;Use the SFTP integration to forward files after processing is done&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;transfer-mode&lt;/code&gt; to forward files after processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All good options. Each serves particular use cases, though with quite a bit of overlap.&lt;/p&gt;

&lt;p&gt;Local processing is of course the fastest, so making CsvPath Framework essentially be the bronze layer of a local data lake is common. In this scenario, the archive(s) are the part of the lake that handle data files.&lt;/p&gt;

&lt;p&gt;Alternative, any remote backend (&lt;code&gt;s3://&lt;/code&gt;, &lt;code&gt;azure://&lt;/code&gt;, &lt;code&gt;gs://&lt;/code&gt;, &lt;code&gt;sftp://&lt;/code&gt;) can receive data as it is generated. That is a notably slower option because of network latency. This is not a CsvPath Framework issue; every tool has it. In many case, however, speed is either not an issue because the data size is moderate (MB, not GB) or because some other concern overrides write time. And, of course, lights-out automation makes slower easier to take.&lt;/p&gt;

&lt;p&gt;There is also an SFTP integration that allows you to forward files to an SFTP server. Why is that different from using the SFTP backend? It is not much. It allows you to populate archive and also copy output to another location. This can be useful for returning files to data partners for their checks without providing access to the archive. There may be other similar use cases.&lt;/p&gt;

&lt;p&gt;This article is about the fourth option &lt;code&gt;transfer-mode&lt;/code&gt;.  &lt;/p&gt;

&lt;h2&gt;
  
  
  Speed With Flexibility
&lt;/h2&gt;

&lt;p&gt;Transfer mode is a &lt;a href="https://www.csvpath.org/topics/the-modes" rel="noopener noreferrer"&gt;Framework-native mode&lt;/a&gt; that allows you to transfer data to any backend. (Framework-native basically means it is always available; there's no turning it off like an integration; practically speaking, this means it is not under the control of an admin or project lead; any csvpath writer can use it). &lt;/p&gt;

&lt;p&gt;How is that different from just using a backend? It's different because &lt;code&gt;transfer-mode&lt;/code&gt; is a post processing step, not a streaming-during-processing step. That means it is potentially much faster and can go somewhere outside the archive. The location doesn't need to be a &lt;code&gt;data.csv&lt;/code&gt; or &lt;code&gt;unmatched.csv&lt;/code&gt;; you can name the file anything you like.&lt;/p&gt;

&lt;p&gt;In addition, transfer-mode permits something very un-CsvPath Framework: it allows you to merge data. CsvPath permits a lot of things in validation and upgrading. It does not, generally, permit any ETL-like stuff. And for the most part, ETL tools are not great at validation (though they tend to be Ok in upgrading scenarios). We pick our battles. And we solve for preboarding, which ETL tools to a good approximation never do.&lt;/p&gt;

&lt;p&gt;But with transfer mode you can pack two data files into one. And this can be incredibly helpful in the case of you needing a quick and dirty join and you don't want to unpack the ETL to do it. &lt;/p&gt;

&lt;p&gt;When might you want that? A good example might be running a named-paths group in serial with simple, single-rule csvpaths that need to aggregate data. Basically an OR across csvpaths, allowing for rules that are easier to test. There are, of course, other possible scenarios. &lt;/p&gt;

&lt;h2&gt;
  
  
  How Does Transfer Mode Work?
&lt;/h2&gt;

&lt;p&gt;Like this: you add the transfer mode metadata tag to an external comment and a variable indicating the path to transfer to. It might look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~
    id:output to data lake
    transfer-mode: data &amp;gt; out+
~
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here the ID is &lt;code&gt;output to data lake&lt;/code&gt;. That is just useful documentation and helpful in validation print statements. Transfer mode is directing &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt; to push the &lt;code&gt;data.csv&lt;/code&gt; output to the location specified by the &lt;code&gt;@out&lt;/code&gt; variable. The trailing &lt;code&gt;+&lt;/code&gt; indicates the &lt;code&gt;data.csv&lt;/code&gt; should be appended to any already existing file at that location. If we wanted to send data to another place we would just add another data-angle-variable statement after the one you see, separated by a comma.&lt;/p&gt;

&lt;p&gt;It's that simple. Here's a version of the same with the body of a csvpath attached.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~
    id: boston freds
    transfer-mode: data &amp;gt; lake+
    source-mode: preceding
~
$[*][
    line(
      string(#firstname,25,1),
      string(#city,35,5),
      integer.notnone(#zip,5),
    )

    lower(#firstname) == "fred"
    lower(#city) == "boston"

    @lake = "s3://all-freds-in/cities.csv"
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>dataengineering</category>
      <category>data</category>
      <category>csvpath</category>
      <category>csv</category>
    </item>
    <item>
      <title>A Stranger In a New Town: CsvPath metadata fields</title>
      <dc:creator>David Kershaw</dc:creator>
      <pubDate>Tue, 25 Nov 2025 02:27:34 +0000</pubDate>
      <link>https://forem.com/david_kershaw_b6916404da6/a-stranger-in-a-new-town-csvpath-metadata-fields-1non</link>
      <guid>https://forem.com/david_kershaw_b6916404da6/a-stranger-in-a-new-town-csvpath-metadata-fields-1non</guid>
      <description>&lt;p&gt;&lt;em&gt;The horse half died by the time we came off the highland. That's why I don't name my horses. My boots had holes. My six was dry. Nothing in my pockets but metadata. What's a guy gotta do to get a drink in this town?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Metadata is the wild west. &lt;/p&gt;

&lt;p&gt;A great example that goes beyond data is tags. A lot of you have probably noticed in AWS that tags are amazing. Amazingly hard to use well for any sizable project, let alone the enterprise. Same with the other cloud providers. &lt;/p&gt;

&lt;p&gt;Metadata is supposed to drive everything, ideally from metadata catalogs. But just you try figuring out how to capture consistent metadata across all your systems automatically so that they are up to date and consistent. Then you sit back and think, ok, I got that, so now what should I capture and how should I use it and how do I know I can trust it. &lt;/p&gt;

&lt;p&gt;Sorry partner, I'm not here for that. &lt;/p&gt;

&lt;p&gt;I'm laying out that prologue just to highlight that it's a deep subject and ornery to operationalize. You may not have gotten a lead line on &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt;'s small contribution yet. Let's take a look. It's a narrow vista of a much larger landscape, but manageable and super useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Types of Metadata
&lt;/h2&gt;

&lt;p&gt;There are three types of metadata in CsvPath Framework:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Framework generated&lt;/li&gt;
&lt;li&gt;User configured&lt;/li&gt;
&lt;li&gt;User defined&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Framework generated metadata is everything collected in the act of: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Staging data in named-files&lt;/li&gt;
&lt;li&gt;Loading csvpath statements in named-path groups&lt;/li&gt;
&lt;li&gt;Running named-files against named-paths groups&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these activities results in metadata that is minimally captured to JSON files. Most of it can also be sent to a relational database and/or observability platform. The JSON files are  &lt;strong&gt;manifest.json&lt;/strong&gt; for Framework mechanics and &lt;strong&gt;meta.json&lt;/strong&gt; for runtime generated metadata. It's the latter that I'm going to focus on here. &lt;/p&gt;

&lt;p&gt;User-configured metadata and user-defined metadata come from the CsvPaths themselves. They are located in "external comments". An external comment is one that is above (or less commonly below) the body of a csvpath statement. In files with multiple csvpaths separated by &lt;code&gt;---- CSVPATH ----&lt;/code&gt; external comments live between the csvpaths.&lt;/p&gt;

&lt;p&gt;User-configured metadata are, primarily, the modes, along with some integrations-specific fields known only by the integration that allows them. A mode is one of 11 settings that can be applied on a csvpath-by-csvpath basis. Modes do things like: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Determine how validation is handled&lt;/li&gt;
&lt;li&gt;Set the logical operator used to combine match components&lt;/li&gt;
&lt;li&gt;Switch on collecting of unmatched lines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And a bunch of other useful things. Modes are built into the Framework. They are helpful in understanding why the results you get are what you got.&lt;/p&gt;

&lt;p&gt;User-defined metadata are fields that you, the csvpath writer, create to document your data, and potentially to trigger behavior in other systems. A user-defined field looks like a word with a colon after it. This is a metadata field: &lt;/p&gt;

&lt;p&gt;&lt;code&gt;description: this csvpath validates order files&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The meaning is pretty clear, we're creating a description field and setting its value to &lt;code&gt;this csvpath validates order files&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  CsvPath Framework Tags
&lt;/h2&gt;

&lt;p&gt;The Framework doesn't use the word "tag" today. Neither does &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Data&lt;/a&gt;. That's one reason for me to drop this post. Tags are super helpful, but since we call them user-defined metadata fields, a lot of words, and then don't talk about them much, they are probably under-used.&lt;/p&gt;

&lt;p&gt;So I'll just call them tags.&lt;/p&gt;

&lt;p&gt;When you create a tag, it is free text. Any one word followed by a colon creates a tag. The value of the tag runs until the next word-with-a-colon is seen. You can also stop a tag by just a stand-alone colon. That can be useful if you prefer to put your tags above a narrative description of the csvpath.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    ~ 
      copyright: © atesta analytics
      author: William Blake
      test-data: examples/schemas/example-one.csv
      : This csvpath shows how metadata is created along side  
      documentation in external comments. It is just a quick example.
    ~
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we created two metadata fields, "copyright" and "author", as well as using a well-known instruction for FlightPath ("test-data") and adding some documentation.&lt;/p&gt;

&lt;p&gt;When we run our csvpath against a CSV we get something like this: &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F12bjc3sg0tm011tr3zbq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F12bjc3sg0tm011tr3zbq.png" alt="meta.json is a file capturing metadata and runtime metrics for every validation run of a named-file CSV against a named-paths group" width="800" height="352"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can see the two fields we created. The test-data field for FlightPath is there. (Though in the run that created the screenshot we didn't use it.) Print mode was added by the Framework in the background. Any modes we used explicitly would also show here. And you can see the entire original comment for context.&lt;/p&gt;

&lt;p&gt;Simple! And in many ways very similar to AWS or JIRA or any other system that offers tags-based organization. &lt;/p&gt;

&lt;p&gt;Now, what do we do with these metadata field tags? Well, one obvious thing to do is to document our csvpaths and the data they validate and/or upgrade. This means capturing the world of a csvpath's run as: - narrative docs&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Framework generated metadata&lt;/li&gt;
&lt;li&gt;User-defined tags&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some things in CsvPath Framework are clear enough at a technical level from metadata you don't have to define yourself. For example, you know what named-file and named-paths group were used in every run. But you don't know who the data belongs to. Even if you have an indication by the named of the named-file or the path within the named-file that gets you the data file bytes you might not know and your downstream system almost certainly has a different viewpoint.&lt;/p&gt;

&lt;p&gt;We can add some more tags: &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7yt78hfxnz2fcw6nsnqx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7yt78hfxnz2fcw6nsnqx.png" alt="Adding metadata fields to a csvpath that will be used to validate CSVs" width="800" height="347"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Data&lt;/a&gt; we see the metadata flow into a run's meta.json:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhz13jg3zvm9nuduhtg5b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhz13jg3zvm9nuduhtg5b.png" alt="The meta.json file captures all the user-defined, system-defined, and runtime metadata" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And now we have the opportunity to pull that run's metadata from the archive using FlightPath Server's API:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa1lraf69n9bavb5o6b02.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa1lraf69n9bavb5o6b02.png" alt="FlightPath Server's metadata access API gives you the metadata produced during your last CSV or Excel file validation" width="800" height="303"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How you use this feature is of course up to you. While you can annotate your schemas and rules with inline comments, a good use for user-defined metadata fields is to say more about what each part of a schema means. &lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    ~
       User schema for the Wild West Order Management application.

       username: this username is controlled by the SSO
       firstname: a free optional field. middle names can go here if needed.
       family_name: not optional. we expect a single name, possibly hyphenated. we're not the system of record. the name should match SSO lastname.

        validation-mode: print, raise
    ~
    $[*][
       line.user.distinct(
          string.username.nonnone(#0, 35, 8),
          string.firstname(#firstname, 40),
          string.lastname.notnone(#family_name, 55)
       )
    ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clearly there's a ton of documentation here for a tiny schema, as well as several hard constraints. We can easily find a way to convey basically everything about the data file that you'd want to know, and that's before looking at the Framework-defined metadata and runtime metrics data. That means, if we go all out on the metadata, we have a lot of choices to make. &lt;/p&gt;

&lt;p&gt;The potential for the tag-o-sphere to become a mess is high, of course. The good news is that CsvPath Framework and FlightPath are not intended to be a metadata catalog. They can and should feed a catalog and/or stand ready to serve data based on metadata fields known by other systems. But you don't typically browse the archive as a metadata repository like you might with OpenMetadata, DataHub, or Secoda. &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Framework&lt;/a&gt; is a producer system, not a consumer system. It will tell you what you need to know in great detail, but unlike those other systems, CsvPath doesn't offer you all the things you don't know.&lt;/p&gt;

&lt;p&gt;Again, this is powerful stuff. You can safely ignore user-defined metadata if you choose. But as your data operations expand and mature, you have an awesome opportunity to add a huge amount of clarity for downstream users through producing the right metadata. And it's easy to do. &lt;/p&gt;

&lt;p&gt;Not bad for a one-horse town.&lt;/p&gt;

</description>
      <category>metadata</category>
      <category>dataengineering</category>
      <category>csv</category>
      <category>datascience</category>
    </item>
    <item>
      <title>A real-world example of CsvPath schemas</title>
      <dc:creator>David Kershaw</dc:creator>
      <pubDate>Fri, 14 Nov 2025 01:31:05 +0000</pubDate>
      <link>https://forem.com/david_kershaw_b6916404da6/a-real-world-example-of-csvpath-schemas-1j0p</link>
      <guid>https://forem.com/david_kershaw_b6916404da6/a-real-world-example-of-csvpath-schemas-1j0p</guid>
      <description>&lt;p&gt;Zuora's example invoice upload CSV came to my attention in a timely way. Just last night I posted this &lt;a href="https://blog.csvpath.org/do-schemas-have-a-place-in-delimited-data" rel="noopener noreferrer"&gt;comparison of SQL and CsvPath schemas&lt;/a&gt;. CSV schemas are an interesting topic for anyone who likes to geek-out about document and data validation. (👋 Raises hand...) but are they practical?&lt;/p&gt;

&lt;p&gt;As it turns out, yes. And a good example of where you might use them comes from the Zuora docs: &lt;a href="https://knowledgecenter.zuora.com/Zuora_Platform/Data_Management/Data_Loader/Guidelines_for_CSV_file_upload_in_Data_Loader" rel="noopener noreferrer"&gt;Guidelines_for_CSV_file_upload_in_Data_Loader&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This file format combines invoices with invoice items. That's a very common thing to do. It is about as tasteful as a wildly denormalized SQL table, but practical, for sure. How might we validate an invoice upload file? We would use &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath schemas&lt;/a&gt;. Could we use just plan CsvPath rules? Sure, we could. But as you'll see, &lt;code&gt;line()&lt;/code&gt; schemas are cleaner.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl1a0u8dugy6aaxcijxy2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl1a0u8dugy6aaxcijxy2.png" alt="Full window view of the FlightPath Data app" width="800" height="515"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're playing along at home, pop open FlightPath Data. Those of you who are new, &lt;a href="https://www.flightpathdata.com" rel="noopener noreferrer"&gt;FlightPath Data&lt;/a&gt; is CsvPath Framework's favorite frontend. It is available free on the Windows store and the Apple macOS app store.&lt;/p&gt;

&lt;p&gt;Take a look at the Zuora &lt;a href="https://knowledgecenter.zuora.com/@api/deki/files/35699/CSV_Invoice_-_Sample_CSV_1.csv?revision=1" rel="noopener noreferrer"&gt;example file&lt;/a&gt;. In FlightPath Data it looks like: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4eylmmhsor64r5mnwrmx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4eylmmhsor64r5mnwrmx.png" alt="Grid view of a CSV in FlightPath Data" width="800" height="339"&gt;&lt;/a&gt;. It starts with three empty lines. If you switch to plain text it's even clearer. The lines have delimiters but no values. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwf8hzu46m0rfga1ho5n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwf8hzu46m0rfga1ho5n.png" alt="Plain text view of the example CSV file" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is our first challenge. Let's say that we can accept the four blank lines without calling the file invalid. How does our CsvPath have to accommodate that? The answer is one of those "easy if you know it" things.&lt;/p&gt;

&lt;p&gt;In this case, we're going to assume that we'll typically have blank lines. The accommodation/fix is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    empty.nocontrib(headers()) -&amp;gt; skip()
    after_blank.nocontrib() -&amp;gt; reset_headers()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These two match components say that if a line is empty we skip it. We put the &lt;code&gt;nocontrib&lt;/code&gt; qualifier there to say that we don't want emptiness to be a validation criteria. In other words, the &lt;code&gt;empty()&lt;/code&gt; doesn't contribute to the match calculation. The &lt;code&gt;skip()&lt;/code&gt; make CsvPath jump to the next line without matching the current line and without checking any of the match components below it.&lt;/p&gt;

&lt;p&gt;Then we have to find the headers. Technically we don't need to do this, because we can access header values using indexes. E.g. &lt;code&gt;#0&lt;/code&gt; is the same as &lt;code&gt;#IsNewInvoice&lt;/code&gt;. However, the names are easier to work with. To find the headers we simply check if the previous line was a blank, and if so, reset the header names to be the values in that line. And again, we don't want the &lt;code&gt;after_blank()&lt;/code&gt; function to determine if a line matches. Easy!&lt;/p&gt;

&lt;p&gt;Now the fun part. We need to define two schema entities using the &lt;code&gt;line()&lt;/code&gt; function. One is for the invoices. The other is for each invoice's line-items. Interestingly, and again commonly for this form of CSV, the invoice lines are cohabitated with the first invoice item. That is no trouble at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The invoice entity
&lt;/h2&gt;

&lt;p&gt;A new invoice is indicated by a &lt;code&gt;TRUE&lt;/code&gt; value in the #0 header. When we see that we need to match:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   #0 == "TRUE" -&amp;gt; line.invoice(
       blank(),
       string(#"Account Id"),
       string(#"Invoice Date"),
       string(#"Auto Pay"),
       string(#comments),
       string(#"Invoice Number"),
       string(#"Invoice.PO Number"),
       wildcard()
   ) 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This enitity skips the first header, accepts strings for the next six headers, and adds a &lt;code&gt;wildcard()&lt;/code&gt; to allow any number of additional values to exist to the right.&lt;/p&gt;

&lt;p&gt;When you run this, you get two matching lines. With our validation strategy, these two lines are valid. Everything else in the file is skipped or invalid. We're doing well. We have invoices. But we still need invoice items. Or, I should say, we have invoice items but they are incorrectly being found to be invalid. We'll fix that.&lt;/p&gt;

&lt;p&gt;For our second entity, the invoice items, we make another &lt;code&gt;line()&lt;/code&gt;. Keep in mind that a &lt;code&gt;line()&lt;/code&gt; is always a complete description of a single line. But at the same time it is also as specific as you like, and so may not consider all header values. &lt;/p&gt;

&lt;p&gt;The way &lt;code&gt;line()&lt;/code&gt; can model both a whole line and an entity within a line is by using &lt;code&gt;blank()&lt;/code&gt; and &lt;code&gt;wildcard()&lt;/code&gt; to block out the parts of a line that it doesn't care about. We do this in the invoice entity in the first header position because we don't care about the signal &lt;code&gt;IsNewInvoice&lt;/code&gt;. Here, in the invoice item entity, we don't care about the invoice, so we'll skip over that with &lt;code&gt;wildcard()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   #IsNewInvoiceItem == "TRUE" -&amp;gt; line.item(
       wildcard(8),
       string(#"Invoice Item Amount"),
       datetime(#"Invoice Item Service Start Date", "%Y-%m-%d %H:%M:%S"),
       string(#"Invoice Item Charge Date"),
       string(#"Invoice Item Charge Name"),
       string(#"Invoice Item Description"),
       string(#"Invoice Item Quantity"),
       string(#"Invoice Item Service End Date"),
       string(#"Invoice Item Unit Price"),
       string(#"IsNewInvoiceItemTaxItem"),
       string(#"Invoice Item Tax Item Tax Code"),
       string(#"Invoice Item Tax Item Tax Mode"),
       string(#"Invoice Item Tax Date")
   )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are very simple entities without much typing or constraints. In a real invoice validation csvpath statement you would probably use types, constraints, and qualifiers, and add rules outside the &lt;code&gt;line()&lt;/code&gt;. But here we're only roughing out what is possible.&lt;/p&gt;

&lt;p&gt;This looks good! It should work, right?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk8401zarbqe2c14klpuv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk8401zarbqe2c14klpuv.png" alt="Two invoices, no invoice items" width="800" height="730"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Well, not exactly. If we run this we will get the two invoices, but not the invoice items. Why is that? &lt;/p&gt;

&lt;p&gt;The reason is we're missing a &lt;code&gt;nocontrib&lt;/code&gt; qualifier. We are matching the two invoices just fine. But when we don't match on an invoice we are considering the line to be invalid. That means that we only get one invoice item per invoice -- the one that is on the same line as the parent invoice. But clearly that's not what we want.&lt;/p&gt;

&lt;p&gt;Since we always have a first invoice item on an invoice's line, all we need to do is declare that there being an invoice, or not, doesn't determine if a line matches. I.e., the invoice entity doesn't contribute to the match. And we already saw how to do that with the &lt;code&gt;nocontrib&lt;/code&gt; qualifier. We just add another to the invoice test on the first column like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    #0.nocontrib == "TRUE" -&amp;gt; line.invoice(
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now when you run the csvpath, you get all the invoices and all their invoice items in the matched lines collected. You have validated your file, and as a bonus, removed the extraneous empty lines at the top.&lt;/p&gt;

&lt;p&gt;Easy, simple, and if you're a validation geek like me, really cool looking!  Have fun with this &lt;a href="https://www.csvpath.org" rel="noopener noreferrer"&gt;CsvPath Validation Language&lt;/a&gt; schema stuff. If you get stuck, leave a comment and I'll be happy to help out.&lt;/p&gt;

</description>
      <category>data</category>
      <category>dataengineering</category>
      <category>csv</category>
      <category>csvpath</category>
    </item>
  </channel>
</rss>
