Forem: David Kershaw

JSONPath Is In! The AI Assistant Will See You Now

David Kershaw — Thu, 29 Jan 2026 01:22:12 +0000

JSONL is a neat and kind of weird data format. It is well-known to be useful for logs and API calls, among other things. And your favorite AI assistant API is one place you'll probably find it.

CsvPath Framework supports validating JSONL. (In fact, it supports JSONL for the whole data preboarding lifecycle, but that's a longer story).

And now CsvPath Validation Language supports JSONPath expressions. Since AI prompts are only kinda sorta JSONL, having JSONPath to dig into them is helpful.

What I mean by kinda-sorta is that your basic prompt sequence is a series of JSON lines, but the lines are all 1-column wide and arbitrarily deep. That sounds more like a series of JSON "documents" than it does like single JSONL stream. Or, anyway, that's my take.

Let's look at how to use JSONPath to inspect a JSONL file using CsvPath Validation Language in FlightPath Data. For those of you who don't already know, FlightPath Data is the dev and ops frontend to CsvPath Framework. It is a free and open source download from the Apple MacOS Store or the Microsoft Store.

The file is a common example prompt. We'll start by looking at one line.

Here's the last line:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a happy assistant that puts a positive spin on everything."
    },
    {
      "role": "user",
      "content": "I'm hungry."
    },
    {
      "role": "assistant",
      "content": "Eat a banana!"
    }
  ]
}

From CsvPath Framework's perspective this is a one-header document. The one header is messages. If you open this in the grid view you see only the one header. (i.e. one column; but with delimited files we try to stick to the word "header" because with "column" your RDBMS-soaked brain starts to make incorrect assumptions).

Here's what it looks like:

That's not super fun. The reason is that:

JSONL doesn't present its headers in the grid view (for good reasons)
The messages header is arbitrarily deeply nested, unlike the typical JSONL log line

Nevertheless, that's what we have. Will it blend? I mean validate? Yes. JSONPath to the rescue. That said I'll pause to admit that I'm not a JSONPath expert.

Right-click in the project files tree on the left of FlightPath and create a new .csvpath file, e.g. messages.csvpath. Drop this simple example in it.

$[*][ 
    push("roles", jsonpath(#messages, "$[?(@.role == 'assistant')].content") )

    last() -> print("See the variables tab for results")
]

You can see the jsonpath() function. It is acting on the messages header, as we'd want. We're pushing the data pulled by the JSONPath expression into a stack variable named roles.

A stack variable is like a Python list or tuple. You create a variable by using it. roles is part of the set of zero or more variables that are available throughout the csvpath statement run. They are captured to the Variables tab for a one-off FlightPath Data test run. For a production run they end up in the vars.json file in the run results.

Put your cursor in the csvpath statement and click cmd-r (or ctrl-r on Windows). The output tabs should open at the bottom-middle of the screen, below your csvpath file. Click to the Variables tab and have a look.

Our JSONPath:

$[?(@.role == 'assistant')].content

picked out the objects in the messages array where role equaled assistant. And from those objects it extracted and returned the value of the content key.

Pretty simple stuff. Tho, I have to admit it took me a few minutes to wrap my JSONPath-inexperienced head around the context for the JSONPath expression. I was thinking of the whole document or the whole line, but that wasn't right.

It is obviously the JSON value assigned to the messages key, which is an array, in this case. Once I was operating from that correct context, the JSONPath became pretty straightforward. (Those of us with XPath scars need not be as afraid as we might be!)

The point here is two-part. First, we can deal with AI prompts or any other JSONL that is deeply nested. Hooray! The data may look odd, if you are comparing it to regular tabular data, but that's no reason to not validate.

Second, this example makes the point that we're doing JSONPath rules-based validation within our CsvPath context. How very Schematron-like, since Schematron does XPath validation within XSLT.

Maybe this sounds complicated, but really it's not. CsvPath Validation Language is great for all things line-oriented. In this case, there isn't much for it to do, except hand off to JSONPath, which is great at documents (a.k.a. objects). Simple enough.

If we wanted to create a bunch of JSONPath rules to validate our AI prompt JSONL, we could do that. To just do a quick throw-away second rule as an example try this:

$[*][ 
    push("roles", jsonpath(#messages, "$[?(@.role == 'assistant')].content") )

    @stmts = jsonpath(#0, "$.`len`")
    print("Line $.csvpath.line_number: $.variables.stmts")
    @stmts == 3 

    last.nocontrib() -> print("See the variables tab for results")
]

That new rule will net you 2 lines, which are either valid or failed, depending on how you want to use your csvpath statement. You will see them in the Matches tab.

At the same time the expanded csvpath statement will continue to pull in the same data to the variables tab that we got with the first version of the csvpath.

To clean it up just a little, you can do:

$[*][ 
    push("roles", jsonpath(#messages, "$[?(@.role == 'assistant')].content") )

    jsonpath(#0, "$.`len`") == 3
]

There you go, a valid 2-rule validation statement using JSONPath on nested JSON in a JSONL document. Useful? Totally! Give it a try.

Custom Functions FTW

David Kershaw — Wed, 28 Jan 2026 22:20:50 +0000

CsvPath Validation Language is functions-based. It applies a very simple syntax and a large number of functions to validate CSV, JSONL, and Excel files in ways that were never-before possible.

And then comes the moment when you want to do some crazy thing that the CsvPath Framework contributors didn't think of. What to do? You create a custom function that does exactly that.

Custom functions even work in FlightPath Data and FlightPath Server. I call that out because FlightPath Data is a multi-project environment. And FlightPath Server is both multi-project and also multi-user. That means functions must be scoped and sandboxed. They are and they work great!

Let's create a trivial example to show the setup of a custom function. I'll leave the actual functionality as an exercise for the reader, since that part is demonstrated copiously in the CsvPath Framework Github repo.

The goal

Our goal is to create a function called sure(). It will functionally be the same as yes(). I.e. it returns True.

Our csvpath statement will be:

    $[*][ sure() ]

If you try that in FlightPath Data you will get this error message:

The config.ini

The first step is to point to a functions import file. By default, import files are called functions.imports and live in the project's config directory. In FlightPath Data, click Open config at the bottom left of the app to open the config panel. Then click functions in the vertical tabs to open the functions config form. The form has just one field for the path to the imports file. The path can be relative or absolute.

Once the path is ready click Save and reload and then Close config.

Next we need to edit the imports file to include our sure() function. Right click on the blank space in the project files tree and select Open project directory.

Open the config directory and you should see three files, config.ini, env.json, and functions.imports. If you don't see all three don't worry about it; some files are generated just in time. If there is no functions.imports create one. Then open it.

In functions.imports we're going to add one line that imports our sure() function.

This is basically the same form as Python's. It says find the example/one/yes.py file and import the Yes class, using the name sure as the function name of the imported class. Simple!

Finally we just need to put the custom function in the right place. The right place, starting from the project's home directory, is config/example/one/yes.py.

I copied the regular Yes class from its file in the repo to make my example yes.py. Again, we're just setting up a custom function, not showing how to write an awesome function.

This is where the yes.py file lives.

And... we're basically ready to go. However, if you ran a csvpath already, restart FlightPath to clear the function classes that were already loaded. You can do this programmatically in CsvPath Framework, but there's no button in FlightPath's config yet.

That done, back in FlightPath Data right click in your project files tree and create a new file. Call it test.csvpaths, or whatever you like.

In it, paste our test csvpath:

    $[*][ sure() ]

Make it look like:

Now, with your cursor inside the csvpath statement, click cmd-r (or ctrl-r on Windows).

You should see the message Test run complete. Matched 2 lines. in the status bar and the printouts tab should be blank. (If your status bar says Test run complete but has a different number of lines don't worry about it; your test data and mine just aren't the same).

For a bit of comfort that all is working as expected, add a print line like this and you should get the same printouts shown.

And that's all there is to it. Now you'll never be stumped by the absence of Cool Function X, because you can write it yourself.

To be fair, though, while a simple function can be trivial, as we just saw, more complex functions can be... well, more complex. If you need help creating your awesome function don't hesitate to reach out. We'll be glad to help you get started.

Comparing Validatar to CsvPath Validation

David Kershaw — Wed, 28 Jan 2026 02:33:38 +0000

Let's go back to the buffet and compare CsvPath Framework and FlightPath Data to another validation tool. Today we'll look at Validatar. Once more into the breach dear friends!

As with the other comparisons, please remember that data quality tools like SodaCL, Great Expectations, or today's contestant, Validatar, only do data quality. CsvPath Framework, by contrast, is a data-file feeds management infrastructure that covers data validation as just one aspect of the full data preboarding lifecycle.

Moreover, CsvPath Framework does not deal with relational databases (other than as an option for storing its own metadata). Validatar et. al., are first and foremost relational database quality management tools, and only secondarily deal with data files. So it's a mismatch, to some degree, but useful and entertaining nonetheless.

The Validatar Example
The CsvPath Way

The Validatar example we'll replicate using CsvPath Validation Language and FlightPath Data is at https://docs.validatar.com/docs/exercise-10-create-a-uniqueness-monitoring-template-for-csv-files. Just from the name of the exercise you know this is more up CsvPath Framework's alley than it is Validatar's.

The Validatar Example

Here's the problem description at the top of the Validatar example:

This Standard Test is designed to demonstrate the concept of how to create a uniqueness template test for all CSV files in multiple folders.
The Standard Test here compares the Row Count per account_id value in the account_data.csv to make sure all account_id's only have 1 row. The test only keeps failures and stops after 100 failure records.

Spoiler alert: in FlightPath this is a trivial example (as I think it is meant to be)

Validatar starts by having you create a test template. Before you can do that, though, you need a project. Here are the instructions for that step:

Choose your project
Make sure your Data Source is Mapped correctly

The first bullet sounds simple. I'm not sure what the second bullet means because I'm not a Validatar expert and that one isn't explained on that page.

Moving on, let's create that template. Most of this exercise is forms based. The setup is shown in the image below. Their ask is that you notice:

Note that the column specified to group by is account_id
Note that it is comparing the ROW_COUNT to a fixed value of 1
Note that the Result Configuration is set so that only Failures are kept and to abort after 100 failures are found

Good requirements for us to use on the CsvPath side.

At this point we have our test. Now we need to create a template from it so that we can apply it to each CSV file. This is how we get to a single action we can apply to multiple files in a uniform way.

I'm going to just add the bullets because the screenshot is in the link above, which is of course a more complete description. We do:

Click Build Template
Update the Folder input to {{schema.name}}
Update the File input to {{table.name}}
Update the Column input to {{#replace table.name "_data.csv" "_id"}}
Update the Metadata Links to {{schema.name}}.{{table..name}}
Change the Generate column list using to Dynamic Template Configuration
Update the Dynamic Script

The dynamic script is pretty simple:

    [
    {"name":"{{#replace table.name "_data.csv" "_id"}}","sequence":1,"type":"Numeric","role":"Key"},
        {"name":"ROW_COUNT","sequence":2,"type":"Numeric","role":"Value"}
    ]

Now, we're going to use some metadata to filter down to the files we care about.

Switch to the Metadata Selection Tab
Change to the Use Filters option
Add a Filter on the Table Name Field that contains "_data.csv"

At this point, check that the filter finds your files and run the example. You should be good to go. My feeling is that all works better for database tables than for CSV files, just as you would expect from Validatar.

Now, CsvPath Framework

Once more, this time with feeling! Let's see how CsvPath Framework and FlightPath Data can make the same magic happen. And, hopefully you'll agree that it's much simpler and more powerful for its use case.

The requirements, again

Create a uniqueness test for all csv files in multiple folders
The column specified to group by is account_id
Only keep failures
Stop after 100 failures

The core of these requirements is the validation statement. Using CsvPath Validation Language this is next to trivial:

$[*][ 
    @duplicate_accounts.nocontrib == 100 -> stop()
    has_dups(#account_id) -> counter.duplicate_accounts(1) 
]

(The @ sign means a variable and the # sign indicates a header name)

This csvpath says: for each line in a file check if the counter is 100. If it is, stop processing that file. Otherwise, increase the counter if the #account_id is a duplicate.

The statement will collect only error lines because:

The counter is a side-effect with no contribution to matching
The check if @duplicate_lines equals 100 is marked to not contribute to matching. (Using the nocontrib qualifier)

The function that does the heavy lifting is has_dups(). If that returns True (i.e. the value of true()) we match the line and capture it.

All pretty readable. Now what do we do with it?

FlightPath Data FTW

All of what we need to do is almost as simple in Python using only CsvPath Framework. Almost! But using FlightPath Data it is even simpler.

In FlightPath, create a new file called dups.csvpath. Paste in our statement.

Right-click on dups.csvpath and select Load csvpaths.

In the load dialog give the named-paths group the name dups and click Create.

You should see your csvpath show up in the middle window on the right under the dups folder. When you load a csvpath statement it always goes into a group.csvpaths file. And when you click on that file its background is pale green to let you know you cannot edit it. (You can, of course, over-write it anytime without losing prior versions, but that is another topic for a different post.)

Next stage your data. In the example, each file is in its own folder and its folder is one of many in the same directory. We'll just add the parent folder and let FlightPath find the files for us.

To do that right click the parent directory and select Stage data. In the stage data dialog uncheck the Separate named-files checkbox because we're going to have every physical file be one version of the same named-file. Think of a named-file as a category that has one file assigned to it at a time, in sequence. We say named-files have versions.

In the named-file name box type accounts. That's our category. You will see your data in the top-right window as a directory named accounts. I used a template of :6/:filename in order to keep the month folders, but that is completely optional.

The result looks like:

Finally, right click the accounts folder, or the dups folder below it, and select New run. In the run dialog, for named-paths select dups. For named-file type in $accounts.files.:all. That named-file name is a reference that indicates every version of the accounts named-file. Again remember, the named-file is like a category that registers a file at a time. We registered a bunch of files and now we're applying our CsvPath Validation Language statement to each of them in turn.

And here's the Run dialog:

When you click Run you will see your results in the lower right-hand window. Your run is date stamped within the dups results. In your date-stamped run you can see the data.csv where your duplicate lines landed. In this image I dropped each run into its own folder using a template; you can see the test and test2. That is completely optional, of course.

And that is it!

There is, of course, much more you can do with CsvPath Framework. Likewise, Validatar has a ton more functionality than what we showed. But now you've had a taste of both.

What I'd hope you come away with is that CsvPath Framework is the better tool for CSV, JSONL, and Excel file validation. The ease of using FlightPath Data for this validation example makes the case well. Obviously, for relational database validation, Validatar is your horse.

And of course I also want to point out again that CsvPath Framework is a complete data preboarding solution, not just a validation engine. Preboarding inbound data files is a big deal. If you need that (and who doesn't?) you owe it to yourself to take a look at CsvPath Framework.

To whet your appetite, here's a post on data preboarding build or buy. Enjoy!

JSONL is a seriously weird format!

David Kershaw — Thu, 15 Jan 2026 02:56:45 +0000

CsvPath Framework has had JSONL support for a while. It wasn't until we updated FlightPath Data to edit JSONL that I really started noticing how seriously free-wheeling JSONL is. At least in the context of a certain kind of developers tool, JSONL is even weirder than CSV.

What Wouldn't JSONL Do?

Here are a few of the quirks we pondered:

There is no header order
Headers change line-by-line...
...or is there one total set of headers that grows line-by-line?
Arrays and dictionary lines can cohabitate
Header values (using CsvPath terms; a.k.a. dictionary values) can be complex objects
Array lines have headers that equal their values...
...or do array lines just not have headers?
...or are array indexes associated with the most recent headers?
With all this header (a.k.a. column) uncertainty you may have to read the whole file to know what the shape of the data is so that you have a reference point for when you check the lines.

All of this and more!

The Visualization Is Where They Get You

And it only gets more nuanced and nutty when you try to make a grid view editor. A simple editor has no useful way to show dictionary-line data. The compromises are a big deal. You can choose to make the headers an ever growing set and then sort them and display the data in the grid according to the sorted header name position, thereby allowing you to serialize the data back to JSONL. But that is a lot of choices that add up to a very particular way of handling the data that might not work for everyone.

We split the difference. FlightPath shows JSONL data in a grid view. However, it does not attempt to show any headers. And if you edit the data in the grid view you have to save it as CSV. Since that's a lot of compromise, we made it possible to use the JSON editing view for JSONL. That gives you syntax highlighting, pretty printing (JSONL style!), and full control of the data, but at the cost of the greater productivity you would have had in a grid view.

So Why JSONL?

JSONL is great for data that is less tabular or fluctuates even more wildly than CSV does. It gives cleaner programmatic access to the contents of each line. In most cases, JSONL's key based access is both self-documenting at a line level, albeit leaving it a bit harder to understand the file as a whole, and keeps you from having to access data through indexes. And while it is more verbose for many purposes, JSONL can potentially save significant space in certain wide and sparse use cases.

All that said, if you're a stickler for correctness, control, and referenceable validity, JSONL might give you hives. Even more than CSV. And that's saying something.

If you're feeling the need for an antidote to crazy JSONL, CSV, and Excel data, checkout CsvPath Framework and FlightPath Data.

Is CsvPath an easy or hard language?

David Kershaw — Sun, 14 Dec 2025 05:34:54 +0000

CsvPath. What? Why? How?

CsvPath Framework includes CsvPath Validation Language. To save scarce ASCII, let's call it CVL.

CVL is a document-oriented tabular data validation language like XSD, Schematron, DDL, and JSONQuery. You can use CVL to validate CSV, Excel, and JSONL files, as well as Pandas-type data frames in memory. The purpose of CVL within CsvPath Framework is to be the validation and upgrading capability within the larger data preboarding architecture.

CsvPath Validation Language is line-based. That means it validates streaming line-by-line, rather than looking at a complete set of data all at once. All of the other languages referenced above are full-set-at-a-time languages, not line-by-line. That means that, in principle, they trade memory size for speed. In practice the picture is more nuanced. DDL in particular typically has strategies that allow it to handle data larger than RAM. CVL being line-by-line also means it can do some things that the full data set languages cannot (easily) do and it cannot do some things (easily) that they can do. In both cases the runtimes they run within can compensate for limitations, at least to some degree.

Also worth pointing out, CVL and all of the others mentioned, are declarative languages. By and large, you don't program them to follow a logical algorithm determined by you. Rather you declare what should happen or what you are testing and let the runtime figure out how to give you the desired results.

CVL is both schema-based and rules-based. Its schema definition capabilities are most like SQL's DDL. Its rules capabilities most like Schematron and SQL's queries. CVL is an intentionally simplistic language, at least in grammatical terms. However, that is like saying the Chinese language is grammatically simple. Yes, but have you seen the characters?

In CVL's case, while the challenge isn't at the magnitude of tens of thousands of Chinese characters, the set of functions is still substantial. CVL has hundreds of functions that can be used to create rules that go far beyond schemas into business logic validation. Knowing which functions are the best ones to use for a particular purpose is perhaps the biggest challenge. And on top of that are a host of qualifiers, modes, and other power toys.

Why do we need a tabular data validation language? Because tabular data, CSV especially, is widespread, is way too flexible in ways that hurt productivity and add bottom-line risk, and is frequently under the control of external data partners, not internal staff. Moreover, the lowest cost point, by far, to find and fix data problems is as close the source as possible. Every step data takes inward from the enterprise's edge magnifies the blast radius of errors dramatically. And every time you need to rewind a process or replay a transformation, your cost and risk go up like skyrockets. 🧨

A Practical Question

So now we have a sense for what CsvPath Validation Language is and why it is important. A practical next question is what it is going to cost me in time and toil to learn to use it?

The answer is relative. If your goal is to validate the data types of a data frame now and then, CVL is probably too much trouble compared to hacking up a Python script, especially if you are already comfortable with Pandas or Polars. CVL was not made for one-off validations, although it can certainly do them well enough.

But if your goal is to run a data collection operation involving tens, hundreds, or thousands of data partners and document types, CVL and CsvPath Framework are vital. Compared to the challenges of:

designing and building your own framework or
handling every partnership as a one-off unique project,
of managing the codebase to control drift and tech debt,
of maintaining knowledge as team members change over time, or
compensating for other languages' lack of applicability to the tabular data file problems you face, or
of firefighting constantly with one arm tied behind your back,
etc., etc.

Compared to all that, CVL is an absolute walk in the park.

We would argue that CVL is an easy language to use, easy to learn, capable of most situations it was built for, and only about as capable of helping you shoot yourself in the foot as SQL. Which is to say, it's not hard to blow your foot off, but it's quite doable to just not. And, and this is important, CsvPath Validation Language is built for automation, not ad hoc scripting. We expect that you'll treat CVL as code, use iterative dev and tests, create it methodically with an eye on simplicity and reuse, version it, etc. CVL will treat you as well as you treat it, like any other serious language.

So, after all that, is CVL easy to learn and use? It depends on your perspective.

A (Relatively) Simple Example

I'm writing this because I ran into some CVL that was quite straightforward, but also had a lot of small things to remember. That's the "it's easy if you know it" side of so many things.

Here's the scenario, it may be familiar to some of you.

We have a data partner in the retail space who sends us a CSV of orders on a weekly or monthly basis. We want to validate files as the enter the enterprise, before they are accepted as being basically good data. (Downstream our application may reject some orders, but its more exacting business rules are not pertinent to preboarding raw data as it arrives; we're not trying to reinvent the app, just protect it from crap).

Here is the sample data:

ID,date,time,store,address,city,state,zip,category,type,shelf,vendor,product name,UPC,SKU,unit,quantity,a price
03358993,03/21/2024,10:24:14,Bob's store,1 Lakeshore Drive, Chicago, IL, 33581,OFFICE,PAPER,1-5,Sams Paper,20lbs Ream,0301024855,,per each,8,20.99
03358994,03/21/2024,10:31:28,Fred's store,1 Lakeshore Drive, Chicago, IL, 33581,OPERA,WRITING,1-5,Biz Pen,10-pack Black,0541931855,0432950078,per each,2,4
03358995,03/21/2024,11:26:18,Mary's story, 1 Lakeshore Drive, Chicago, IL, 33581,FOOD,CANDY,7,Starbursts,Single,3583900656,0899920453,per each,1,1.29d

We want to catch several things here using rules. Creating a schema would be a different exercise than what I was doing. Here we want to catch:

Bad prices
Wrong categories
Missing SKU (stock-keeping units)
Missing UPC (universal product codes)

This is one of the exercises on https://www.csvpath.org, but simplified a bit.

The way I tackled these rules was to create this csvpath. In the real world, I probably would have chosen to create multiple csvpaths, 1 per rule, and run them as a named-paths group. But here I just whipped up this one-off csvpath. You can tell it's a one-off because it includes the path to a file in its root. I.e. ${FILE}.

~
  validation-mode:no-raise, no-stop, print, no-fail, collect
  logic-mode:OR
~
${FILE}[1*][

~ 1x wrong, 2nd item ~
@in = in( #category, "OFFICE|COMPUTING|FURNITURE|PRINT|FOOD|OTHER" )
not( @in.asbool ) ->
    error.category( "Bad category $.headers.category at line $.csvpath.count_lines ", fail())

~ 2x wrong, 2nd and 3rd items ~
@price_format = exact( end(), /\\$?\\d*\\.\\d{{2}}/ )
not( @price_format.asbool ) ->
    error.price("Bad price $.headers.'a price' at line  $.csvpath.count_lines", fail())

~ 1x missing, 1st item ~
#SKU
not( #SKU ) ->
    error.sku("No SKU at line $.csvpath.count_lines", fail())

~ always exists ~
#UPC
not( #UPC ) ->
    error.upc("No UPC at line $.csvpath.count_lines", fail())
]

These are the things I found myself jotting down as I went. They are not all-new or unfamiliar. At first I just wanted to remember the changes I was making. Then I started thinking about what a new CVL user would have to know.

THINGS TO PAY ATTENTION TO:
* using OR logic
* the importance of validation-mode
* scanning instructions and headers
* variable existence tests and the asbool qualifier
* side-effects vs. match determiners
* assignments don't determine matches
* header existence tests
* add names to errors, at least here (how id_chains work)
* what is fail()?
* use " on headers in body, use ' on headers in print() and error()

10 things

Yeah, it's a lot. But does that make CVL hard? Is the sum of the parts powerful? Again, depends on your point of view. I'd say not super hard, but it takes practice, same as any language. And, yes, powerful, if you need it; otherwise, it could be overkill, as I said above.

Let's go through the items.

Using OR logic

The example uses logic-mode to switch from the default ANDing to ORing. Modes are ways of configuring a single csvpath to behave in a certain way without changing your CsvPath Framework project's global config. You set modes in comments at the top of a csvpath. Comments are demarcated with ~ characters.

AND is the default. That means that within the "matching part" of a csvpath each declared unit, called a match component, is ANDed together to determine if a line matches the csvpath statement. (Read about AND and OR here). That means all match components must evaluate to true (in Python terms True) for the line to match.

In this case, we're going to use OR by setting logic-mode: OR. The decision is somewhat arbitrary for the exercise. We could have used AND with only small changes to the csvpath, but the starting point for my work on this example was already OR so I left it.

OR means that each match component has the opportunity to determine if a match. In AND all must be true. In OR all it takes is one true for the line to match. This means we're using an alternate validation strategy. In OR we're saying "good looks like one of these things"; whereas, in AND we're saying "good requires all of these things".

The importance of `validation-mode`

Another mode, validation-mode, determines what happens when we hit an error. validation-mode controls the action that happens when the Framework runs into a validation or language error. A validation error is a data error. It may be your own custom rule or it may be built-in. A language error is when you try to do something that is syntactically wrong or uses a match component incorrectly.

validation-mode has the following options when an error happens:

Raise an exception, resulting in the run terminating noisily
Print the error to the default printstream (or you can choose another printstream if the error is from your own custom rule)
Stop the run cleanly
Fail the run, but keep going
Collect the error for later programmatic review and/or in a JSON file

You can negate these options by adding no- as a prefix to the option. E.g. no-raise. And you can select multiple options at once by separating them with a comma. E.g.

   validation-mode:no-raise, print, fail

The most important things to remember about validation-mode is that raise will halt runs and print stack traces; whereas, no-raise will keep going and errors may be harder to detect. no-raise + no-print is generally a bad option, at least in dev, because you will have trouble telling what happened when you hit an error. The second most important thing is that if print is turned off project wide, you lose a lot of information, and in that case using validation mode to turn print on is often a good idea. However, printing is expensive. When you do a lot of printing the run slows down considerably, so if you expect a lot of errors you may want to just collect them silently and review that JSON after the run, rather than printing a ton of lines that may slow a large file down too much.

Scanning instructions and headers

This is an easy one. The "scanning part" of a csvpath is the first set of brackets. Often it looks like: [*]. Scanning instructions can pick out certain lines to validate and skip over other lines. One of the most common use cases is to use [1*] to say that we want to scan from line 1 to the end of the file. This means the 2nd line to the end, because lines numbers are 0-based. What this does is skip over the header line. The header line is, initially, always line 0. You can reset the headers any time, but that first line is the typical location.

If you know you have a header line you probably don't want to validate it. Let's parse that statement. You may want to validate that the headers are, say, in a certain order, all-caps, from a certain list of names, etc. But you don't have to do that one the 0th line. You can declare those rules and they will take affect, just as long as any one or more lines are considered for matching. At the same time, while your header rules won't trouble other data lines, the reverse may not be true. If you expect the item count header to have an integer that you want to sum, you probably don't want to see an error on line 0 because item count is not an integer.

Variable existence tests vs the `asbool` qualifier

When you have a variable all by its self it is a test for the existence of the variable. Variables are words with an @ in front, @firstname. A variable is created the first time it is used. But saying @firstname by itself doesn't create the variable, rather it returns true if @firstname exists, otherwise false.

This can be confusing if you have a variable with a boolean value where you want the variable to vote on the line matching. We have that in the example:

    not( @price_format.asbool )

If the price format is False we want to do something. But, if we just do:

    @price_format

We aren't voting to match based on the correctness of the price format. We're instead voting to match based on the existence of the price format variable. In order to vote with the value of the variable we have you use the asbool qualifier. A qualifier modifies how a match component behaves. In this case we're saying "pay attention to my value, not to if I exist".

Side-effects vs. match determiners

At a high level, a match component can do one of three things:

Determine if lines match (a.k.a. vote on matching)
Return a calculated value
Do some other thing that doesn't provide a vote or a value

The latter is called a side-effect. An obvious example is print(). You can print without voting on a line matching or not. And print() doesn't return a value that can be used in an assignment or existence test.

An assignment is a side-effect. For e.g.

@in = in( #category, "OFFICE|COMPUTING|FURNITURE|PRINT|FOOD|OTHER" )

This match component is an assignment of the value produced by in() to the variable @in. You might think that either or both sides of a when/do expression would be the same. A when/do is an expression where a left-side match component is separated from a right-side match component by a -> operator. When/do is CVL's if/then statement. This guess would be reasonable, but wrong. A when/do operation allows both sides to vote. That means that:

    not( @price_format.asbool ) ->
    error.price("Bad price $.headers.'a price' at line  $.csvpath.count_lines", fail())

The left-hand side is voting to match or not based on the boolean value of the @price_format variable. At the same time, error.price() is not voting because an error() is a side-effect. To be more specific, not() both generates a value and also votes on matches; error() does neither.

With hundreds of functions, CVL demands a quick function lookup with detailed documentation. Luckily you get that in CsvPath Framework's CLI (just do poetry run cli) and/or look in FlightPath Data in the language helper window that opens on the right when you are editing a csvpath file.

Assignments don't determine matches

We've just covered this point-to-remember above, but it bears repeating: an @x = "true" expression does not vote on if a line matches. In this case we only set @x to the value "true", nothing more. We can subsequently use @x to vote using either its existence:

@x

or its value:

   @x.asbool

To (hopefully!) be over-clear, this csvpath:

   @x = "true" @x @x.asbool

has 3 match components and will always match every line because not only does @x exist, but @x is also true, which means @x.asbool evaluates to a Python True under the hood based on the "true" value of @x.

Header existence tests

Similar to the point I just made above. A header existence test is like #SKU. That means the #SKU header evaluates to true when a line has a value for SKU; otherwise, false. In the example, we aren't checking what the value is, we only care that there is a value. If we needed the value of #SKU to be treated as a boolean we would again use the asbool qualifier, but in this case we do not need that.

The example uses OR logic and we want to collect all the lines, valid or not, so I have both the rule predicate not(#SKU) and the existence test #SKU. The first determines if we get a validation error message and collects lines that are missing SKUs, and the second collects the line if there is a SKU, so that we get all the lines.

An improvement to the example might be to remove the two match components #SKU and #UPC and instead add yes(). yes() always votes to match. In OR we only need one positive vote, so yes() collects every line. The reason this would be an improvement is that the single yes() replaces two match components, and fewer is generally better. And, because, as written, we collect all lines because of the SKU and UPC match components, but we don't follow the same pattern with price and category. It would be nice to use the same pattern for all four rules, even if either the SKU or UPC match components collect all the lines for everyone. Using a yes() separates the line collection from the validation rules and makes all the rules behave the same, while at the same time simplifying the csvpath a little.

Add names to errors

It's important to remember that print() and error() are essentially the same thing, with one super important difference error() creates error events. Both functions print messages to printstreams, optionally using print references to include potentially detailed metadata. But only error() throws off JSON error events that are captured to the errors.json and available programmatically. And the error messages you create using error() are printed in error-formatted lines.

An error-formatted line is like log output that is formatted using a log template. You can set the error template in config.ini or the config panel in FlightPath Data. Your config can choose to print "bare" errors or "full". A bare error doesn't use the error template at all; instead, the message is the whole output to the printstream. The default error message template is:

{time}:{file}:{line}:{paths}:{instance}:{chain}:  {message}

Now comes the point. If you add names to your error() functions, like

error.price("Bad price $.headers.price at line $.csvpath.count_lines")

You get a full-format error message that has a clear ID chain. (Assuming your template includes the {chain}. What does this mean?

An ID chain is the path-like identifier of each match component, scoped within an expression. Each match component in your csvpaths is, potentially, composed of other match components in a tree structure. Each top-level match component is considered an expression. (The component structure is held in an Expression object, even if we rarely speak of it). $[*][#a #b @c] is a csvpath with 3 expressions, each of which has one component. $[*][add(#a,#b,@c)] is a csvpath with 1 expression that has 4 components.

Within each expression the name-by-name path, parent to child, leading to a specific match component is its ID chain. In an error line the prefix before the message might look like this:

    2025-12-14 04h18m59s-936643:March-2024.csv:11:::category

Here category is what we declared as:

    not( @in.asbool ) ->
       error.category( "Bad category $.headers.category at line $.csvpath.count_lines ", fail())

However, this simple example doesn't show the power of the ID chain. Let's make a contrived example that helps better show it. If we change the error category match component to nest another child match component, while removing the category name on the error(), like this:

    not( @in.asbool ) ->
       error( "Bad category $.headers.category at line $.csvpath.count_lines ", error("just testing", fail()))

The error message we get from the nested error is this:

    2025-12-14 04h20m04s-975094:March-2024.csv:11:::error[1].error[1]

That says that the first error child of an error raised an error event. We can see what that means because it's a simple example. But since ID chains are always scoped to the expression, not the csvpath statement as a whole, what happens when we have very similar error checks for SKUs and UPCs? error[1].error[1] doesn't feel so informative. And these rules are simple; in more complex situations ID chains are both more important and, potentially, harder to read.

Just adding the names to the functions makes all the difference to interpreting where the error occurred. Let's change the expression to:

    not( @in.asbool ) ->
       error.category( "Bad category $.headers.category at line $.csvpath.count_lines ", error.testing-idchains("just testing", fail()))

The error message of the 2nd error() becomes:

2025-12-14 04h33m20s-635277:March-2024.csv:11:::category.testing-idchains

That's much more readable! Now our UPC, SKU, price, and category errors are much clearer, making error investigation just a bit less frustrating and a bit more quick.

What is `fail()`?

There are many approaches to validation using any competent validation language in any non-trivial use case. CsvPath Validation Language is no exception, and in fact, due to the quirks of CSV and tabular data file processing CVL has more options than most. Among these, the big ones are:

Collect matching lines as the valid data
Collect matching lines as the invalid data
Print or error() to indicate errors, without collecting data or making a ruling on the file as a whole
Pass or file a data file as a whole based on what rules, schemas, or built-in validations passed or failed

The last option, marking the data file invalid based on rules and/or schemas, relies on the fail() function and the settings for error handling in config.ini and validation-mode to set the run's is_valid field. is_valid is visible and/or accessible in:

Run metadata JSON
Print and error function references (e.g. $.csvpath.valid)
The valid() and failed() functions
Programmatically on the CsvPath and Result objects

Depending on your config and validation-mode setting, if any, built-in errors can automatically fail a run. In some cases, though, you may want to be more deliberate in failing your runs based on certain rules, but allowing others to just be informational. For instance, you might want to flag a date format error, but then simply upgrade the data, rather than failing the run for a very correctable problem. At the same time, if the date was empty but expected, you might want to use fail() mark the run and data file invalid because you found an uncorrectable, unignorable problem.

Use `"` on headers in body, use `'` on headers in `print()` and `error()`

And finally, an easy one! We use #"my header name with spaces" to refer to a header named my header name with spaces. At least, that's how we do it in the world of header match components. Ah, but the world of print() and error() messages is a bit different. You see the problem, right? A print or error message is wrapped in quotes, so it's not possible for a header reference to use quotes.

Instead, within the print or error message we simply use single quotes for header names that need to be quoted. This really isn't a big ask. Here is our example:

#"a price"
not( @price_format.asbool ) ->
    error.price("Bad price $.headers.'a price' at line  $.csvpath.count_lines", fail())

Here we are testing if the #"a price" header exists; presumably to make sure we match lines with prices. Then we're raising an error if the @price_format variable is False. In the error() message we refer to the same header as $.headers.'a price'. Not too much to ask. But it is one more one-little-thing to remember.

So, What Is CsvPath Validation Language: Easy? Hard?

I hope this post helps call out some of the knowledge that can help you create effective validation and upgrading scripts. There is, of course, much more to unpack. But the real question here isn't how much or how little, how hard or how easy. The real question is, is it worth it? And the test of that for many people is: does the solution make easy stuff easy and hard stuff possible?

By that measure, CsvPath Validation Language and the whole CsvPath Framework scores well.

How Many Ways Can A CSV File Hold Multiple Entities?

David Kershaw — Sat, 13 Dec 2025 05:36:49 +0000

Focus On the Structure

CSV, Excel, and JSONL need more validation love

Tabular data can be handled with as much rigor, quality, and efficiency as document-form, containment models (e.g. JSONSchema, XSD) or tables based data (e.g. SQL). And it should be!

As one example, CsvPath Validation Language can accommodate multiple entities living in a single tabular document using a sophisticated schema syntax as capable as DDL. In fact, it can do this at least four different ways. Each of them common enough in the wild.

Mixed parent-child relationships by line adjacency and order
Multiple entities' data lines grouped one entity after another
Entities side-by-side, line-by-line
Entities in sub-table like clusters organized visually and floating in a tabular landscape (e.g. Excel files with ancillary table/boxes for sub-calculations, dimensions, etc.)

The image above outlines each of these. Can you think of more ways to position multiple entities in a tabular CSV, Excel, or JSONL data file?

Have a look at this article on CSV validation schemas to see the syntax and some approaches used in the validation part of CsvPath Framework's data preboarding architecture.

SQL: Doing GROUP BY in CsvPath

David Kershaw — Thu, 04 Dec 2025 01:28:14 +0000

Let's look at how to create a simple GROUP BY report in CsvPath Framework's tabular data validation language. Of all our examples, this is an easy one!

A GROUP BY query is straightforward. It selects rows and groups them according to one or more columns. The archetypal example is:

SELECT  
  dept, 
  role, 
  SUM(salary) total_salary
FROM    employee
GROUP BY dept, role

This query produces a result set with three columns. The first two are a unique combination of the dept and role columns and the third is a summation of the salaries in the rows with that combination of department and role.

For our example we'll use FlightPath Data's built-in examples because they are handy. Every time you create a project FlightPath creates an examples folder with a directory tree of useful how-to examples. In this case we'll use examples/counting/projects.csv as the data and create a new csvpath for our work. To create your new csvpaths file just right-click on the counting directory and select New file.

What we need is to replicate the calculation in the SQL and at the end of the data file output a table. That's the easiest way to make an analog to SQL's GROUP BY. For this example we prefer to output at the end of the run because CsvPath works line-by-line and we don't care about intermediate results.

As an aside, a SQL database will likely not have to do a table scan for the GROUP BY, so the database will have an speed advantage in large data situations. CsvPath is built for automation and typically does somewhat different kinds of processing than most SQL, including things that require the line-by-line approach. In most cases, unless you're working with gigabytes of raw data the real-time performance is fine. Either way, once automated you'll be off doing something else.

First scaffold your csvpath. FlightPath may have already done this for you.

~ test-data:examples/counting/projects.csv ~
$[*][

]

So far, this says scan the whole file (*) and do nothing. The test file is indicated in test-data. In production that directive will be ignored, but within FlightPath Data it keeps us from having to select a data file for each test run.

Next let's do the simplest thing:

~ test-data:examples/counting/projects.csv ~
$[*][
    subtotal.worker_hours(#agency, #13)
    last() -> var_table("worker_hours")
]

These two lines say that we want to subtotal the values in header #13 (0-based) for each different value of the #agency header. Header #13 is a.k.a. #worker_hours_this_period. Sometimes you need to use a header's index number for a practical reason. In this case #13 is just less typing. We're giving the subtotal variable a name that is meaningful to us. That could help if we were going to use subtotal() more than once in a csvpath.

Then we have the last() function. last() matches only the very last line in the data file. When last() matches we are going to do whatever is on the right-hand side of the ->. That symbol, ->, is the when/do operator. It says, when a thing is true, do this other thing. It is CsvPath Validation Language's version of the if/than statement.

What we want to do is to output a table of worker's hours by agency. We have the data in a variable called @worker_hours that was created by subtotal(). Variables are addressed using the @, just like header values are addressed using the # sign.

Many functions generate data in variables, often so you can use it right there within the csvpath. In FlightPath Data you can see your test run's variables in the Variables tab. In production you would see your variables in the run's directory in vars.json.

So then, on the last line our csvpath will call the var_table() function, passing it the worker_hours variable by name. This function is part of the print functions group, if you're looking at the help window.

var_table() outputs a text table using whatever variable you give it. In our case, a dictionary structure of tracked values. This is the table we see:

Pretty neat! And simple, only two lines.

However, our canonical GROUP BY example at the top used two fields, not just one. Let's do the same here.

This time we're going to create a key for our subtotal. Our key uses the columns we're interested in and sums the values of the workers' hours accordingly.

We're adding just:

    @key = concat(#agency, ", ", #neighborhood)
    subtotal.Aggregate_hours(@key, #13)
    last() -> var_table("Aggregate_hours")

You can see there isn't much difference. The key is all we need to distinguish the lines we want to group. There are other ways to achieve this outcome, but this is a simple one. The result looks like this:

This output isn't super helpful for automated production testing. However, the data that goes to vars.json would be. Nevertheless, this is a useful validation in another sense. Often times data managers are looking for a "report-form" validation. I.e. they want written complaints about the data, if there are any problems. The ISO Schematron standard for XML validation is an example of this approach. It makes it easy to associate XPath statements with human readable user-defined errors.

In the case of CsvPath, as I said above, our focus, first and foremost, is on lights-out automation. To that end, we offer the errors.json file to collect built-in validation errors and the error() function for csvpath writers to create their own custom errors feeding into errors.json.

However, CsvPath also has a sophisticated printouts subsystem that allows you to print to multiple printstreams from within your csvpath. That let's you be informational, display errors, and separate different run information for different purposes and/or audiences that may have different validation needs. There's lots to dig into in that vein.

For now, though, we have what we came for: a shift-left printout validation upstream of the database doing the same thing as GROUP BY but closer to the source of data errors where fixes cost less time, money, and hair-pulling. Not bad at all.

If you want to learn more about CsvPath Framework and tabular data automation check out https://www.csvpath.org. For a closer look at the validation language I'd suggest grabbing a (free!) copy of FlightPath Data and looking at the examples. The Github repo is also helpful.

Comparing Great Expectations and CsvPath Framework

David Kershaw — Sun, 30 Nov 2025 06:46:55 +0000

Today we're going to take a swing at translating a Great Expectations Python script into CsvPath Framework. The script comes from the GE documentation site. First a bit about the tools.

Great Expectations is a data quality tool for live production pipeline data checking. You know, the thing we all know we should be doing but mostly aren't. GE comes as a core expectations library and a paid SaaS service that adds teamwork and visualization. The expectations are essentially data quality rules realized in Python functions. Each function contains the logic of the test and the hooks for it to work within the framework's larger context.

Like CsvPath Framework, GE brings together data sources, data rules, and context to generate metadata. However, Great Expectations is a quality control toolkit, without any of the data management tooling CsvPath brings. CsvPath Framework is for preboarding tabular data files as they enter the enterprise. Where GE leans towards the relational database world, CsvPath is fully focused on edge governance of file feeds. Both tools capture metadata and throw off validation events. GE stays largely within its SaaS service. CsvPath instead supports to widely supported OTLP and OpenLineage protocols.

Other differences include the validation approach. Great Expectations stays in the world of Python. It does not have a schema language or support one, other than SQL. Most of the lifting is done by the packaged and custom "expectations" functions. CsvPath Framework, on the other hand, is a full architecture for preboarding that has a core competency in data quality and data upgrading. It has a schema language for tabular data, CsvPath Validation Language, that is analogous to XSD, DDL, Schematron, or JSON Schema. CSV and Excel finally have someone taking them seriously.

Two Takes On Schema Validation

The example we're going to look at is from https://docs.greatexpectations.io/docs/reference/learn/data_quality_use_cases/schema#strict-vs-relaxed-schema-validation

It gives a validation script that does four things:

A "strict" validation of the columns of a single table in an RDBMS
A "relaxed" validation of the columns of the same table
One common type check
Another rule for variation in allowed types in a column

While this is a relational database example, Great Expectations' forte, it is a valid comparison since GE only checks one table and in ways that would apply equally well to an incoming tabular file. CsvPath Framework can use databases to store its metadata, but it doesn't validate database data.

Setup Boilerplate

Great Expectations' script has a lot of boilerplate. As you might expect with the modest list above, the setup code is the largest part.

In contrast, CsvPath Framework offers FlightPath Data as a no-setup-required GUI environment and/or FlightPath Server as a set of very lightweight JSON REST endpoints that minimize setup boilerplate down to webhook-able simplicity. There is also a CLI that makes running a CsvPath script a no-setup thing.

However, for the comparison I'll give the equivalent Python setup. First the Great Expectations:

import great_expectations as gx
import great_expectations.expectations as gxe

context = gx.get_context()
# Create Data Source, Data Asset, and Batch Definition.
# CONNECTION_STRING contains the connection string for the Postgres database.
datasource = context.data_sources.add_postgres(
    "postgres database", connection_string=CONNECTION_STRING
)

data_asset = datasource.add_table_asset(name="data asset", table_name="transfers")

batch_definition = data_asset.add_batch_definition_whole_table("batch definition")

batch = batch_definition.get_batch()

# Create Expectation Suite with strict type and column Expectations. Validate data.
strict_suite = context.suites.add(gx.ExpectationSuite(name="strict checks"))

GE's own comments are probably sufficient to explain how the library's pieces are being marshaled prior to the validation run.

Now CsvPath Framework's analog

from csvpath import CsvPaths
paths = CsvPaths()
paths.file_manager.add_named_file(name="transfers", path="s3://mybucket/2025-may-transfers.csv")
paths.paths_manager.add_named_paths_from_file(name="transfers", file_path="scripts/transfers.csvpaths")

What we did was:

We created an instance of CsvPaths, the class that runs sets of scripts against versions of files
Next we added a physical file as a new version of a logical file named "transfers". I'm pulling it from s3, but it could be in any of the Framework's storage backends.
Then we added a set of one or more csvpath statements as a named set of validations in a local file.

That's not a lot of setup. Granted, you can do a lot more when you need to; there are many options. But also remember that even that small amount of Python is optional. FlightPath Data, FlightPath Server, and the CsvPath CLI are here for you.

That's it, both Great Expectations and CsvPath Framework are now ready to validate.

The Validations

We'll show two validations, both variations on a theme. Basically we check data order and the type of a column.

The first validation is a strict check on columns/headers. (I'll go into why CsvPath Framework refers to "headers" rather than "columns" another time; there's a good reason, but for now just go with it.) They must match a provided list. And the transfer_amount column/header must have double precision values. We're in the land of Python, so from that perspective we're talking about float values.

GE's Validation

This is Great Expectation's example, so let's look at what they are doing. Here's the code:

strict_suite.add_expectation(
    gxe.ExpectTableColumnsToMatchOrderedList(
        column_list=[
            "type",
            "sender_account_number",
            "recipient_fullname",
            "transfer_amount",
            "transfer_date",
        ]
    )
)
strict_suite.add_expectation(
    gxe.ExpectColumnValuesToBeOfType(column="transfer_amount", type_="DOUBLE PRECISION")
)
strict_results = batch.validate(strict_suite)

Simple enough.

CsvPath Framework's Validation

CsvPath does its validation and upgrading work in CsvPath Validation Language. The language is concise and function-specific. Interestingly, because the language is purpose-built it offers multiple ways to attack the problem. Let's do the most exact match to the GE version:

$[*][ 
header_names_match("type|sender_account_number|recipient_fullname|transfer_amount|transfer_date")
float(#transfer_amount)
]

We start with the scanning instruction. In this case we want to check all lines so we just pass *. Then come the functions in the matching part of the statement.

These two functions are called match components. They are ANDed together (by default, but if needed we can OR). If both evaluate to True the line being considered is a match. In our validation strategy, lines that match are valid. As you can probably tell, this is virtually an exact match to the GE solution.

Is it the best way, though? Honestly, it is fine. But personal preference weighs in. I have to give you the option I would take.

$[*][
   line(
      blank(#type),
      blank(#sender_account_number),
      string(#recipient_fullname),
      float(#transfer_amount),
      date(#transfer_date)
   )
]

As you can see, the # indicates a header, the CSV equivalent of a database column.

For me, that's a more readable structure. It also gives a bit more type information than we require; however, it's pretty easy to guess string and date for recipient_fullname and transfer_date, respectively. I also used the blank() type to assign the header names in positions where I couldn't guess the data types.

Next let's go back to Great Expectations. We're going to slightly update the first validation to do a more relaxed version:

relaxed_suite = context.suites.add(gx.ExpectationSuite(name="relaxed checks"))
relaxed_suite.add_expectation(
    gxe.ExpectTableColumnsToMatchSet(
        column_set=[
            "type",
            "sender_account_number",
            "transfer_amount",
            "transfer_date",
        ],
        exact_match=False,
    )
)
relaxed_suite.add_expectation(
    gxe.ExpectColumnValuesToBeInTypeList(
        column="transfer_amount", type_list=["DOUBLE PRECISION", "STRING"]
    )
)
relaxed_results = batch.validate(relaxed_suite)

Here we're allowing the columns to be any order, but they all must be present and no additional columns added. We're also letting the transfer_amount column now be either a float or a string. Nothing complicated.

Let's look at the same in CsvPath.

$[*][ 
header_names_match.nocontrib.m("type|sender_account_number|recipient_fullname|transfer_amount|transfer_date")

sum(@m_present, @m_misordered) == count_headers()

or(
 float(#transfer_amount),
 string(#transfer_amount)
]

This time the CsvPath statement is a bit more verbose. Here's what's happening.

We use the header_names_match() function again. We don't care if there isn't a strictly ordered match so we add the nocontrib qualifier. Qualifiers modify the behavior of match components. In this case we're telling header_names_match() to not contribute to the determination of if a line matches. We also add an m just to give a simpler name to the backing variables the function creates.

We use those backing variables, specifically @m_present and @m_misordered, to check that we have all the headers and no additional headers. header_names_match() also creates a count of unmatched headers and duplicated headers, but we don't need those.

Finally, we change the type declaration from float() to a logical structure that accepts either a float or a string. This doesn't work in the line() schema form, but it works great as a validation rule.

Results and Metadata

The last thing we want to check is... how did we do? Is our data valid?

Great Expectations does:

print(f"Strict validation passes: {strict_results['success']}")
print(f"Relaxed validation passes: {relaxed_results['success']}")

That works fine for purposes of example.

On the CsvPath Framework side, we have more choices to make. Generally we don't go with something like what the GE example shows, not when we're using the Framework to its fullest.

Let me back-track and say that we could have done a much simpler CsvPath run like this:

from csvpath import CsvPath
path = CsvPath()
path.fast_forward(f"""
${transfer.csv}[*][
   line(
      blank(#type),
      blank(#sender_account_number),
      string(#recipient_fullname),
      float(#transfer_amount),
      date(#transfer_date)
   )
]
""")
print(f"Any errors? {path.has_errors}")

That's everything you need to match the GE example.

What I actually setup, though, was a more robust preboarding automation-friendly harness. It was similar to what you'd use in production, and to what FlightPath Server does behind the scenes. In that world you work with Result objects. Results give much more metadata than you get from Great Expectations; often times (hopefully!) more than you need.

Going back to the original way we setup CsvPath, accessing results to check validity and errors looks like:

from csvpath import CsvPaths

paths = CsvPaths()
paths.file_manager.add_named_file(name="transfers", path="s3://mybucket/2025-may-transfers.csv")
paths.paths_manager.add_named_paths_from_file(name="transfers", file_path="scripts/transfers.csvpaths")
ref = paths.fast_forward_paths(filename="transfers", pathsname="transfers")

results = paths.results_manager.get_named_results(ref)
for result in results:
    print(f"Csvpath has errors: {result.has_errors}, is valid: {result.is_valid}")

Right off the bat you're probably saying why.

Why are we iterating results? We iterate because we can execute multiple csvpath statements at a time. To do that we would load multiple csvpaths statements under the name "transfers". We didn't go into how to do it. Suffice to say, the easiest way is to just put the statements in the same file separated by ---- CSVPATH ----.

And why to we check for errors and validity? Because validation errors is just one possible mark of invalidity. CsvPath Framework considers itself a data preboarding system for files in general, but it takes flat-file validation super seriously, mainly because no one else does. Using CsvPath Validation Language simply is easy, as I hope we have shown. You can, of course, go much further to use it in sophisticated ways that are miles beyond the scope of this post.

Errors and the is_valid result overlap but are not identical. To "fail" a file you can call fail() which makes is_valid equal False. Validation errors can also automatically set is_valid to False, but that is a configuration choice, not the default. In some cases you might want to instead match on incorrect lines and return them. If you did that your file might be considered invalid if more than 0 lines were returned. In some cases, you might want to take a more Schematron-like approach and simply print built-in and custom error messages as a kind of validation report, rather than relying on a single boolean. There are many options. We're just scratching the surface. Whatever you're trying to do, CsvPath Framework has it covered.

All that said, we don't need to over-complicate things. We can keep this simple.

Net, Net, We Have Validated

Ultimately both Great Expectations and CsvPath Framework validated the data with ease. GE having the advantage with the RDBMS, of course. And CsvPath Framework providing the muscle on the data preboarding side of things. I hope this post convinces you that both tools are worth a closer look!

Comparing CsvPath and SodaCL

David Kershaw — Fri, 28 Nov 2025 22:10:28 +0000

Let's have some more fun with comparing and contrasting schema languages. In this post we'll look a schemas + rules-based validation tool, Soda, vis-a-vis CsvPath Framework's CsvPath Validation Language.

SodaCL is the validation rules language for the Soda data quality library. You can learn more at soda.io. I'll say right up front that this is an apples-to-oranges comparison. Here's why:

Soda is mainly relational data focused; CsvPath Framework is mainly files focused
Soda is a data quality tool; whereas, CsvPath Framework is for data preboarding, which includes data quality, but isn't limited to it
SodaCL is a domain-specific language built on YAML that uses embedded SQL; CsvPath Validation Language is a first-class stand-alone validation language, more similar in that regard to DDL or XSD

Nevertheless, both tools do flat-file validation, so it is an apt comparison point. As much so as apples and oranges, both being tasty fruit.

Let's grab a first example from the SodaCL docs and see where it takes us. Please note that these are quick and dirty comparisons. I'm not building the SodaCL or CsvPath for perfection, just giving a rough sense of the differences and similarities.

Duplicates

Here's a duplicate rows query check in SodaCL. Even though it is a SQL check it only checks one table so it seems fair game for comparison to a tabular data file.

checks for dim_product:
  - failed rows:
      fail query: |
        with duplicated_records as (
          select
            {{ column_a }},
            {{ column_b }}
          from {{ table }}
          group by {{ column_a }}, {{ column_b }}
          having count(*) > 1
        )
        select
          q.*
        from {{ table }} q
        join duplicated_records dup
          on q.{{ column_a }} = dup.{{ column_a }}
          and q.{{ column_b }} = dup.{{ column_b }}

This test finds and returns rows that have a common column A + column B. In other words, column A with column B act as a meaningful identity, and if we find a duplicate we found an error. As we're using a SELECT the result is the set of every row that has a duplicate row.

In CsvPath we would prefer to do something a bit simpler:

$[*][
    has_dups(#a, #b)
]

This does almost the exact same thing. The result is the duplicate lines, but not the original lines. An original line is the first line of a set of duplicates.

If we want to know all lines with duplicates, regardless of if they are the original line or not, we can use dup_lines(). This function returns all the line numbers that are duplicated, including the first. That would net us a variable named @dup_lines (or whatever we want it to be named).

The variable would contain a key for every unique value holding a list of line numbers. In order to get the actual lines we would need a second CsvPath that uses the duplicate lines variable to return the all the duplicate lines.

$[*][
   dup_lines.lines(#a, #b)
   no()
]
---- CSVPATH ----
$[*][
   @s = get(
         $dups.variables.lines,
         fingerprint(#a, #b)
        )
   @t = size(@s)
   above(@t, 1)
   append("line", line_number(), yes())
]

Here the first csvpath creates a variable, @lines, that has all the unique a+b header value fingerprints as keys to stacks of line numbers where that fingerprint was found. The no() keeps lines from being collected, since we don't need them.

We load these two csvpaths as a single named-paths group called dups. Running the named-paths group serially makes sure the first csvpath has prepared the data that the second needs before the second starts.

If the @lines gives a stack for any line with a count above 1 the line has duplicates. Because above() returns true the line matches and is collected. QED.

In our example there is no ID that distinguishes lines. In a real case, you might want to have the line numbers so you can better investigate why there were duplicates. As you can see, the dup_lines() function captures that for you in a variable. The variable is available programmatically and in the vars.json file generated by the run.

However, to stay closer to our working csvpath, we can just add the line number to the lines captured. To do that, we append a new header line, giving it the value from line_number(). The yes() says that we want to include line as in the header line.

To strip the CsvPath solution back to essentials that match the SodaCL we get:

$[*][
   dup_lines.lines(#a, #b)
]
---- CSVPATH ----
$[*][
   @s = get(
         $dups.variables.lines,
         fingerprint(#a, #b)
        )
   above(size(@s), 1)
]

That's a nice concise pair of csvpaths. If we were preboarding data with more rules, we would add these two statements to a larger named-paths group covering all the validation rules.

You can see that for data preboarding, CsvPath Framework's purpose-built CSV validation capabilities are on target. SodaCL, while not a preboarding tool, is also highly effective and obviously a better choice for monitoring the data quality of database-housed data downstream of CsvPath. There's more we can compare between CsvPath and SodaCL. We'll return to it in a future post.

Let's say you have a data lake

David Kershaw — Thu, 27 Nov 2025 16:06:11 +0000

Say you have a data lake. How do you get known-good CSV or Excel files into it?

What About Transfer Mode?

CsvPath Framework has a few options:

Locate the archive on the local disk, within the lake
Use a backend to stream to the lake during processing
Use the SFTP integration to forward files after processing is done
Use transfer-mode to forward files after processing

All good options. Each serves particular use cases, though with quite a bit of overlap.

Local processing is of course the fastest, so making CsvPath Framework essentially be the bronze layer of a local data lake is common. In this scenario, the archive(s) are the part of the lake that handle data files.

Alternative, any remote backend (s3://, azure://, gs://, sftp://) can receive data as it is generated. That is a notably slower option because of network latency. This is not a CsvPath Framework issue; every tool has it. In many case, however, speed is either not an issue because the data size is moderate (MB, not GB) or because some other concern overrides write time. And, of course, lights-out automation makes slower easier to take.

There is also an SFTP integration that allows you to forward files to an SFTP server. Why is that different from using the SFTP backend? It is not much. It allows you to populate archive and also copy output to another location. This can be useful for returning files to data partners for their checks without providing access to the archive. There may be other similar use cases.

This article is about the fourth option transfer-mode.

Speed With Flexibility

Transfer mode is a Framework-native mode that allows you to transfer data to any backend. (Framework-native basically means it is always available; there's no turning it off like an integration; practically speaking, this means it is not under the control of an admin or project lead; any csvpath writer can use it).

How is that different from just using a backend? It's different because transfer-mode is a post processing step, not a streaming-during-processing step. That means it is potentially much faster and can go somewhere outside the archive. The location doesn't need to be a data.csv or unmatched.csv; you can name the file anything you like.

In addition, transfer-mode permits something very un-CsvPath Framework: it allows you to merge data. CsvPath permits a lot of things in validation and upgrading. It does not, generally, permit any ETL-like stuff. And for the most part, ETL tools are not great at validation (though they tend to be Ok in upgrading scenarios). We pick our battles. And we solve for preboarding, which ETL tools to a good approximation never do.

But with transfer mode you can pack two data files into one. And this can be incredibly helpful in the case of you needing a quick and dirty join and you don't want to unpack the ETL to do it.

When might you want that? A good example might be running a named-paths group in serial with simple, single-rule csvpaths that need to aggregate data. Basically an OR across csvpaths, allowing for rules that are easier to test. There are, of course, other possible scenarios.

How Does Transfer Mode Work?

Like this: you add the transfer mode metadata tag to an external comment and a variable indicating the path to transfer to. It might look like:

~
    id:output to data lake
    transfer-mode: data > out+
~

Here the ID is output to data lake. That is just useful documentation and helpful in validation print statements. Transfer mode is directing CsvPath Framework to push the data.csv output to the location specified by the @out variable. The trailing + indicates the data.csv should be appended to any already existing file at that location. If we wanted to send data to another place we would just add another data-angle-variable statement after the one you see, separated by a comma.

It's that simple. Here's a version of the same with the body of a csvpath attached.

~
    id: boston freds
    transfer-mode: data > lake+
    source-mode: preceding
~
$[*][
    line(
      string(#firstname,25,1),
      string(#city,35,5),
      integer.notnone(#zip,5),
    )

    lower(#firstname) == "fred"
    lower(#city) == "boston"

    @lake = "s3://all-freds-in/cities.csv"
]

A Stranger In a New Town: CsvPath metadata fields

David Kershaw — Tue, 25 Nov 2025 02:27:34 +0000

The horse half died by the time we came off the highland. That's why I don't name my horses. My boots had holes. My six was dry. Nothing in my pockets but metadata. What's a guy gotta do to get a drink in this town?

Metadata is the wild west.

A great example that goes beyond data is tags. A lot of you have probably noticed in AWS that tags are amazing. Amazingly hard to use well for any sizable project, let alone the enterprise. Same with the other cloud providers.

Metadata is supposed to drive everything, ideally from metadata catalogs. But just you try figuring out how to capture consistent metadata across all your systems automatically so that they are up to date and consistent. Then you sit back and think, ok, I got that, so now what should I capture and how should I use it and how do I know I can trust it.

Sorry partner, I'm not here for that.

I'm laying out that prologue just to highlight that it's a deep subject and ornery to operationalize. You may not have gotten a lead line on CsvPath Framework's small contribution yet. Let's take a look. It's a narrow vista of a much larger landscape, but manageable and super useful.

Types of Metadata

There are three types of metadata in CsvPath Framework:

Framework generated
User configured
User defined

Framework generated metadata is everything collected in the act of:

Staging data in named-files
Loading csvpath statements in named-path groups
Running named-files against named-paths groups

Each of these activities results in metadata that is minimally captured to JSON files. Most of it can also be sent to a relational database and/or observability platform. The JSON files are manifest.json for Framework mechanics and meta.json for runtime generated metadata. It's the latter that I'm going to focus on here.

User-configured metadata and user-defined metadata come from the CsvPaths themselves. They are located in "external comments". An external comment is one that is above (or less commonly below) the body of a csvpath statement. In files with multiple csvpaths separated by ---- CSVPATH ---- external comments live between the csvpaths.

User-configured metadata are, primarily, the modes, along with some integrations-specific fields known only by the integration that allows them. A mode is one of 11 settings that can be applied on a csvpath-by-csvpath basis. Modes do things like:

Determine how validation is handled
Set the logical operator used to combine match components
Switch on collecting of unmatched lines

And a bunch of other useful things. Modes are built into the Framework. They are helpful in understanding why the results you get are what you got.

User-defined metadata are fields that you, the csvpath writer, create to document your data, and potentially to trigger behavior in other systems. A user-defined field looks like a word with a colon after it. This is a metadata field:

description: this csvpath validates order files

The meaning is pretty clear, we're creating a description field and setting its value to this csvpath validates order files

CsvPath Framework Tags

The Framework doesn't use the word "tag" today. Neither does FlightPath Data. That's one reason for me to drop this post. Tags are super helpful, but since we call them user-defined metadata fields, a lot of words, and then don't talk about them much, they are probably under-used.

So I'll just call them tags.

When you create a tag, it is free text. Any one word followed by a colon creates a tag. The value of the tag runs until the next word-with-a-colon is seen. You can also stop a tag by just a stand-alone colon. That can be useful if you prefer to put your tags above a narrative description of the csvpath.

For example:

    ~ 
      copyright: © atesta analytics
      author: William Blake
      test-data: examples/schemas/example-one.csv
      : This csvpath shows how metadata is created along side  
      documentation in external comments. It is just a quick example.
    ~

Here we created two metadata fields, "copyright" and "author", as well as using a well-known instruction for FlightPath ("test-data") and adding some documentation.

When we run our csvpath against a CSV we get something like this:

You can see the two fields we created. The test-data field for FlightPath is there. (Though in the run that created the screenshot we didn't use it.) Print mode was added by the Framework in the background. Any modes we used explicitly would also show here. And you can see the entire original comment for context.

Simple! And in many ways very similar to AWS or JIRA or any other system that offers tags-based organization.

Now, what do we do with these metadata field tags? Well, one obvious thing to do is to document our csvpaths and the data they validate and/or upgrade. This means capturing the world of a csvpath's run as: - narrative docs

Framework generated metadata
User-defined tags

Some things in CsvPath Framework are clear enough at a technical level from metadata you don't have to define yourself. For example, you know what named-file and named-paths group were used in every run. But you don't know who the data belongs to. Even if you have an indication by the named of the named-file or the path within the named-file that gets you the data file bytes you might not know and your downstream system almost certainly has a different viewpoint.

We can add some more tags:

Using FlightPath Data we see the metadata flow into a run's meta.json:

And now we have the opportunity to pull that run's metadata from the archive using FlightPath Server's API:

How you use this feature is of course up to you. While you can annotate your schemas and rules with inline comments, a good use for user-defined metadata fields is to say more about what each part of a schema means.

For example:

    ~
       User schema for the Wild West Order Management application.

       username: this username is controlled by the SSO
       firstname: a free optional field. middle names can go here if needed.
       family_name: not optional. we expect a single name, possibly hyphenated. we're not the system of record. the name should match SSO lastname.

        validation-mode: print, raise
    ~
    $[*][
       line.user.distinct(
          string.username.nonnone(#0, 35, 8),
          string.firstname(#firstname, 40),
          string.lastname.notnone(#family_name, 55)
       )
    ]

Clearly there's a ton of documentation here for a tiny schema, as well as several hard constraints. We can easily find a way to convey basically everything about the data file that you'd want to know, and that's before looking at the Framework-defined metadata and runtime metrics data. That means, if we go all out on the metadata, we have a lot of choices to make.

The potential for the tag-o-sphere to become a mess is high, of course. The good news is that CsvPath Framework and FlightPath are not intended to be a metadata catalog. They can and should feed a catalog and/or stand ready to serve data based on metadata fields known by other systems. But you don't typically browse the archive as a metadata repository like you might with OpenMetadata, DataHub, or Secoda. CsvPath Framework is a producer system, not a consumer system. It will tell you what you need to know in great detail, but unlike those other systems, CsvPath doesn't offer you all the things you don't know.

Again, this is powerful stuff. You can safely ignore user-defined metadata if you choose. But as your data operations expand and mature, you have an awesome opportunity to add a huge amount of clarity for downstream users through producing the right metadata. And it's easy to do.

Not bad for a one-horse town.

A real-world example of CsvPath schemas

David Kershaw — Fri, 14 Nov 2025 01:31:05 +0000

Zuora's example invoice upload CSV came to my attention in a timely way. Just last night I posted this comparison of SQL and CsvPath schemas. CSV schemas are an interesting topic for anyone who likes to geek-out about document and data validation. (👋 Raises hand...) but are they practical?

As it turns out, yes. And a good example of where you might use them comes from the Zuora docs: Guidelines_for_CSV_file_upload_in_Data_Loader

This file format combines invoices with invoice items. That's a very common thing to do. It is about as tasteful as a wildly denormalized SQL table, but practical, for sure. How might we validate an invoice upload file? We would use CsvPath schemas. Could we use just plan CsvPath rules? Sure, we could. But as you'll see, line() schemas are cleaner.

If you're playing along at home, pop open FlightPath Data. Those of you who are new, FlightPath Data is CsvPath Framework's favorite frontend. It is available free on the Windows store and the Apple macOS app store.

Take a look at the Zuora example file. In FlightPath Data it looks like:

. It starts with three empty lines. If you switch to plain text it's even clearer. The lines have delimiters but no values.

This is our first challenge. Let's say that we can accept the four blank lines without calling the file invalid. How does our CsvPath have to accommodate that? The answer is one of those "easy if you know it" things.

In this case, we're going to assume that we'll typically have blank lines. The accommodation/fix is:

    empty.nocontrib(headers()) -> skip()
    after_blank.nocontrib() -> reset_headers()

These two match components say that if a line is empty we skip it. We put the nocontrib qualifier there to say that we don't want emptiness to be a validation criteria. In other words, the empty() doesn't contribute to the match calculation. The skip() make CsvPath jump to the next line without matching the current line and without checking any of the match components below it.

Then we have to find the headers. Technically we don't need to do this, because we can access header values using indexes. E.g. #0 is the same as #IsNewInvoice. However, the names are easier to work with. To find the headers we simply check if the previous line was a blank, and if so, reset the header names to be the values in that line. And again, we don't want the after_blank() function to determine if a line matches. Easy!

Now the fun part. We need to define two schema entities using the line() function. One is for the invoices. The other is for each invoice's line-items. Interestingly, and again commonly for this form of CSV, the invoice lines are cohabitated with the first invoice item. That is no trouble at all.

The invoice entity

A new invoice is indicated by a TRUE value in the #0 header. When we see that we need to match:

   #0 == "TRUE" -> line.invoice(
       blank(),
       string(#"Account Id"),
       string(#"Invoice Date"),
       string(#"Auto Pay"),
       string(#comments),
       string(#"Invoice Number"),
       string(#"Invoice.PO Number"),
       wildcard()
   )

This enitity skips the first header, accepts strings for the next six headers, and adds a wildcard() to allow any number of additional values to exist to the right.

When you run this, you get two matching lines. With our validation strategy, these two lines are valid. Everything else in the file is skipped or invalid. We're doing well. We have invoices. But we still need invoice items. Or, I should say, we have invoice items but they are incorrectly being found to be invalid. We'll fix that.

For our second entity, the invoice items, we make another line(). Keep in mind that a line() is always a complete description of a single line. But at the same time it is also as specific as you like, and so may not consider all header values.

The way line() can model both a whole line and an entity within a line is by using blank() and wildcard() to block out the parts of a line that it doesn't care about. We do this in the invoice entity in the first header position because we don't care about the signal IsNewInvoice. Here, in the invoice item entity, we don't care about the invoice, so we'll skip over that with wildcard().

   #IsNewInvoiceItem == "TRUE" -> line.item(
       wildcard(8),
       string(#"Invoice Item Amount"),
       datetime(#"Invoice Item Service Start Date", "%Y-%m-%d %H:%M:%S"),
       string(#"Invoice Item Charge Date"),
       string(#"Invoice Item Charge Name"),
       string(#"Invoice Item Description"),
       string(#"Invoice Item Quantity"),
       string(#"Invoice Item Service End Date"),
       string(#"Invoice Item Unit Price"),
       string(#"IsNewInvoiceItemTaxItem"),
       string(#"Invoice Item Tax Item Tax Code"),
       string(#"Invoice Item Tax Item Tax Mode"),
       string(#"Invoice Item Tax Date")
   )

These are very simple entities without much typing or constraints. In a real invoice validation csvpath statement you would probably use types, constraints, and qualifiers, and add rules outside the line(). But here we're only roughing out what is possible.

This looks good! It should work, right?

Well, not exactly. If we run this we will get the two invoices, but not the invoice items. Why is that?

The reason is we're missing a nocontrib qualifier. We are matching the two invoices just fine. But when we don't match on an invoice we are considering the line to be invalid. That means that we only get one invoice item per invoice -- the one that is on the same line as the parent invoice. But clearly that's not what we want.

Since we always have a first invoice item on an invoice's line, all we need to do is declare that there being an invoice, or not, doesn't determine if a line matches. I.e., the invoice entity doesn't contribute to the match. And we already saw how to do that with the nocontrib qualifier. We just add another to the invoice test on the first column like this:

    #0.nocontrib == "TRUE" -> line.invoice(

Now when you run the csvpath, you get all the invoices and all their invoice items in the matched lines collected. You have validated your file, and as a bonus, removed the extraneous empty lines at the top.

Easy, simple, and if you're a validation geek like me, really cool looking! Have fun with this CsvPath Validation Language schema stuff. If you get stuck, leave a comment and I'll be happy to help out.