Forem: Wincent Balin

Closure

Wincent Balin — Sat, 05 Aug 2023 06:08:34 +0000

After a pause, this series comes to a conclusion, mostly because of the rapid developments in the area of large language models.

Original intention

At the beginning I intended to create a language model, that would have gotten a prompt "Geschirrabwaschgesetz" (a law about washing dishes) and write me a corresponding law text in German.

I was discouraged from training the original char RNN because of the scary amount of training time with a 110 M training data. Therefore I went with fine-tuning a German GPT-2 (and later the better one; thanks Jo!). The fine-tuning process of such a model is described here or here, for example.

(Un-)expected discovery

I happened to discover that my intended case is covered perfectly by the LLAMA 2 Chat German model (almost, because of a few grammatical errors). This is very likely because of being fine-tuned with the German legal SQuAD dataset, among others.

I do not want to withhold the result from you (produced in LM Studio):

Just look at this beauty! It even defined "Hygiene" in the last subparagraph! And hence this series is concluded.

Build law text corpus

Wincent Balin — Thu, 03 Aug 2023 07:11:22 +0000

In this part of series, I will describe, how to create a corpus of German law texts from https://www.gesetze-im-internet.de.

Previously in series

In the previous parts of this series, we downloaded 6518 German laws, in XML format, stored in ZIP files.

Conversion to plain text

Converting XML documents to plain text format can be accomplished with many tools and technologies, but after thorough considerations about a couple of edge cases I decided to use an XSLT stylesheet.

After studying the DTD file, which was referenced in the XML files, as well as the XML files themselves, following tasks had to be addressed (the paths given use XPath notation):

The XML files have root element /dokumente
The laws are either incredibly short and consist of a single paragraph, or rather long with a table of contents
In the first case from 2., the law name is in metadaten/enbez and metadaten/titel (if the first path is present) or in metadaten/enbez only; in the second case ibid, the title is in norm/metadaten/langue
The text body is always in textdaten
The paragraphs are in the P tags and end with a new line
The definition lists are in DL tags and are rendered similar to paragraphs, but without new line after the last entry
The new line in text has BR tag, but is not rendered if being within a table or a list entry
Table of contents (TOC tags) are excluded, as they repeat paragram titles only and thus senseless in language model training; also, they are unusable in case of plain text, as there are no known page numbers
Titles (Title tags) are rendered with appended new line
Tables (table tags) are rendered with rows (row tags) ending with a new line and all single cells but the last in row one (entry tags) with a tab character appended
The end marker of the law text will be 25 empty lines

And hence the short XSLT stylesheet of about 100 lines:

Run it in Windows using msxsl.exe as XSLT processor like this:

msxsl BJNR001270871.xml giitotext.xsl > BJNR001270871.txt

Concatenating the text files creates a law text corpus.

Next step

In the next part of series we will see how to train a language model with the text corpus we just created.

Fetch German laws

Wincent Balin — Sat, 26 Jun 2021 22:44:44 +0000

In this part of series, I will describe, how to fetch German law texts from https://www.gesetze-im-internet.de.

Four formats

The (federal) laws in Germany are published by the Federal Ministry of Justice and Consumer Protection on https://www.gesetze-im-internet.de. There are also land (i.e. state) laws, published here, administrative regulations, published here, and many more laws, but for the sake of simplicity we will use the texts of federal laws only.

As stated in the notes page, there are four formats available:

HTML (which you can view in browser)
PDF (most suitable for archive or for printed documents)
EPUB (for e-book readers)
XML (original format, which can be converted easily to other formats)

The format of the XML representation is defined by this DTD, which will become very helpful in the next part of this series.

As also stated on the mentioned above notes page, the index XML documents is available at http://www.gesetze-im-internet.de/gii-toc.xml. This index links to XML documents, packed into ZIP archives, all of them having the same name xml.zip.

The choice of the format

From the four available formats, we need the one, which represents the resulting text with the least markup. The requirement comes from the need to generate a future law text with as little markup as possible.

This requirement, of course, eliminates the PDF format, because it is adapted to the printed media. While the HTML format could be converted to text, for example with the veritable html2text, the contents of law texts are split between small sections, hence complicating the conversion. The conversion of the EPUB format to text is difficult to customise, at least in comparison to XML. Finally, for XML format, there is already a converter to plain text, described in another post.

So we need the documents in XML format.

How to parse HTML with batteries included

Even before Beautiful Soup, it was possible to parse HTML data using the class HTMLParser from the package html.parser, documented here.

Also, even before requests, it was possible to fetch data over HTTP with the functions urlopen and urlretrieve from the package urllib.request, documented here and here.

Should you ask yourself at this point, why do I overlook two very nice and tried Python packages, please read the list under First things first in this article.

To parse HTML with the HTMLParser class, you simply create a subclass from it. Then, depending on what you need to get from HTML data, you implement the handle_* methods. For example, to parse links from the https://www.gesetze-im-internet.de front page, you need the following code:

Collecting all XML documents

While, as mentioned above, there is a list of XML documents here, we will try to collect URLs of all XML documents from the list of current documents at http://www.gesetze-im-internet.de/aktuell.html.

The parser implemented for this page is similar to the previous example. As the current documents are grouped by the first character into separate lists, this parser collects the links to these lists:

As all links to document lists are stored in the variable partial_list_urls, we must add another parser to fetch the links to XML documents. This parser also stores law names.

Complete fetch code

If we combine the two examples, and add some error handling and some urlretrieve action as well, we get this:

After executing this code, we get 6518 ZIP files into the cache directory.

Next step

In the next step, we will build the text corpus from all the law texts fetched.

Stay tuned!

Generate German laws

Wincent Balin — Sat, 26 Jun 2021 20:08:52 +0000

In this series, I am going to describe, how to build a generator of German laws.

First things first:

I am doing this for my own amusement.
Because of 1., I will not necessarily seek simple ways to do things.
You will most probably facepalm repeatedly reading the articles from this series.
Given enough time, you will learn to enjoy 3.

The goal

… consists of four parts:

Fetch German laws from https://www.gesetze-im-internet.de
Build text corpus from the downloaded laws
Train a char-RNN with the text corpus
Create an easy to use generator of German laws

The prototype for XML conversion was done in the previous article on this blog.

Next step

In the next post, we are going to create the code that fetches all the law texts.

Stay tuned!

Convert German laws from XML to text using XSLT

Wincent Balin — Sat, 26 Jun 2021 01:33:14 +0000

For a small project, I needed to convert German laws, found at https://www.gesetze-im-internet.de/, from XML format to text format.

The XML format is described here and is defined by this DTD file.

The source code in the following XSL file is pretty straight-forward. Only adding newlines and indenting definition lists posed an additional challenge.

How to import large Plaso file into Timesketch in Docker

Wincent Balin — Thu, 12 Mar 2020 20:20:22 +0000

Sometimes Timesketch, being run in Docker, hiccups when importing a Plaso file too large, like in the issue #1060. You can still upload the file using this shell script:

#!/bin/sh
#
# Run this script with timesketch_import_plaso.sh plaso_file [timesketch_container]

if [ $# -eq 0]
then
    echo Run this script with $0 plaso_file [timesketch_container]
    exit 1
fi

DOCKER_PATH="/tmp/`basename $1`"
TIMELINE="`echo $1 | sed -e 's/\.[^.]*$//'`"
CONTAINER=docker_timesketch_1
if [ ! -z "$2"]
then
    CONTAINER=$2
fi

docker cp "$1" "$CONTAINER:/tmp"
docker exec -it "$CONTAINER" psort.py -o timesketch --name "$TIMELINE" "$DOCKER_PATH"
docker exec -it "$CONTAINER" rm "$DOCKER_PATH"

Download links for Microsoft Windows Services for UNIX 3.5 and SUA

Wincent Balin — Fri, 04 Oct 2019 21:54:16 +0000

Should you want to use Microsoft Windows Services for UNIX (SFU) within Windows XP or Windows Server 2003, you need SFU 3.5, which you will currently (October 2019) find either in the Internet Archive as an ISO image or at Microsoft as setup executables.

Addendum:

The Subsystem for UNIX-Based Applications (SUA), which works with Windows 7, is also available at Microsoft, as well as SUA for Windows Vista.

If you want to run Docker on local Linux box

Wincent Balin — Fri, 04 Oct 2019 11:47:16 +0000

If you would like to run Docker on a Linux box in your LAN, and you already configured the Linux box hostname as computer1 and a user account there as me, and your current Docker environment is Docker Toolbox on Windows together with docker-machine, perform the following steps:

Add user me to the group sudo on your future Docker host: usermod -a -G sudo me
Remove password prompt when running sudo (as described here): Replace %sudo ALL=(ALL) ALL with %sudo ALL=(ALL) NOPASSWD: ALL
Run this command in your current Docker environment to install Docker on your future Docker host: docker-machine create --driver generic --generic-ip-address computer1 --generic-engine-port 2375 --generic-ssh-user me computer1. The last part is the name of the configuration in your current Docker environment.
Activate the configuration in your current Docker environment: eval $("C:\Program Files\Docker Toolbox\docker-machine.exe" env computer1)
Reverse step 2 and, if needed, also step 1

Then you are ready to use Docker on your computer1 box.

Perform step 4 to activate this configuration again.

Starting with Intel Galileo

Wincent Balin — Tue, 01 Oct 2019 15:59:59 +0000

Intel Galileo is a Intel Pentium-based platform, which is supported by Arduino IDE. It runs Linux, and it is possible to install a PCI-Express card (most often the Intel Centrino Wi-Fi board gets installed).

It is also one of the Arduino-certified platforms. The configuration in Arduino IDE consists in installing the Intel i586 boards platform in the board manager. The firmware is uploaded though the USB CLIENT port as usual.

If you want to communicate with Linux on Galileo directly, you need the cable adapter; if you would like to solder one by yourself, the pinout is available here. Connect to the board using serial port with 115200 bps, and voilà!

Office in Vagrant VM

Wincent Balin — Tue, 16 Jul 2019 16:48:25 +0000

I wanted to use office software in a VM, while being able to edit files on the host machine. Usually, people create a VM in VirtualBox and map the host directory into this VM using shared folders. But, because it is a long process, I decided to automate it using Vagrant.

TL;DR

Go to this GitHub repository
Download Vagrantfile and place it into your documents directory
Open your favourite CLI and change into that directory
Run vagrant up to configure the VM and wait for a while
Run vagrant rdp and use vagrant as login and as password
Open LibreOffice and configure your personal data, so the documents with data fields can set them to appropriate values
Edit your documents on the host system!

Design decisions

The Vagrantfile must be placed into directory with documents you want to edit. Of course, I could add the default documents directory on the host OS, but I decided against it: first, it would create additional maintenance burden (especially if the syntax for default OS paths changes), and second, it would not work on systems, where the documents directory was moved to another location. So, for now, the entry for synchronised directory is

config.vm.synced_folder ".", "/home/vagrant/Documents", mount_options: ["dmode=775,fmode=664"]

The VM created by Vagrant is based on Ubuntu Bionic x64. The Vagrantfile installs the packages xubuntu-desktop and libreoffice. Then it also enables Remote Desktop connexions by installing xrdp, by starting it automatically as a service and by enabling port 3389. The forwarding to the port is configured with

config.vm.network "forwarded_port", guest: 3389, host: 33389, protocol: "tcp", auto_correct: true

The automatic start of XFCE session is enabled by adding xfce4-session to the file .xsession. Everything is run as the default Vagrant user vagrant.

Run IDLE (Python IDE) in virtual environment

Wincent Balin — Fri, 28 Jun 2019 11:41:08 +0000

Imagine: you are running software implemented in Python and there is a problem you would like to debug or edit away. The software resides in a virtual environment and apart from this virtual environment and a standard Python installation nothing else is installed (or is not permitted to be installed). What should you do?

You can run IDLE within the activated virtual environment with this command:

python -m idlelib.idle

This command opens the starting window of IDLE with Python prompt. From there you can open the file you would like to edit.

But what if you would open the Python file at once? Use this command then:

python -m idlelib.idle filename

While IDLE does not have all niceties of PyCharm, it is better than Notepad.exe, is almost always installed and has debugging capabilities. You might even enjoy it.

Source: Run IDLE from a batch file

Python shebang

Wincent Balin — Sat, 15 Jun 2019 11:08:02 +0000

Currently, in the interregnum where Python 2 and Python 3 may co-exist on the same system, the PEP 0394 recommendations for the shebang line in Python programs run in short like this:

Use #!/usr/bin/env before Python interpreter
Use #!/usr/bin/env python only for programs that work with both Python 2 and Python 3
If your program runs with Python 3 only, replace python in the previous line with python3
Else, for Python 2-based programs, replace python with python2