Forem: Pinei

Uso de "skills" e abordagem "spec-driven" com Claude Code

Pinei — Wed, 01 Apr 2026 18:17:43 +0000

O Claude Code é um ambiente de codificação agentivo: ele não apenas responde perguntas, mas pode ler a base de código, editar arquivos, executar comandos e verificar resultados ao longo de um loop de trabalho que combina entendimento, ação e validação. Além disso, ele pode ser usado no terminal, IDE, desktop e navegador, com um mesmo conjunto de instruções e extensões compartilhadas entre esses ambientes.

Dentro desse modelo, duas ideias se destacam para ganhar consistência e escala: skills e uma forma de trabalhar orientada por especificações (spec-driven). As skills são extensões em Markdown que adicionam conhecimento, workflows reutilizáveis e comandos invocáveis ao Claude Code; já o spec-driven não aparece como um “recurso oficial” com esse nome, mas é uma estratégia de uso totalmente alinhada às práticas que a Anthropic recomenda: partir de uma especificação clara, decompor o trabalho, definir critérios de verificação e manter artefatos estruturados ao longo da execução.

O que são skills no Claude Code

No Claude Code, uma skill é uma capacidade modular definida em um arquivo SKILL.md. Esse arquivo pode conter frontmatter YAML com metadados — como nome, descrição, ferramentas permitidas e modo de invocação — seguido por instruções em Markdown que orientam o comportamento do agente quando a skill é acionada. O Claude pode carregar uma skill automaticamente quando ela parece relevante para a solicitação do usuário, ou o usuário pode chamá-la explicitamente por meio de um comando como /nome-da-skill.

As skills são importantes porque funcionam como uma camada entre o “conhecimento genérico” do modelo e as necessidades específicas do seu time ou projeto. Em vez de repetir o mesmo prompt toda vez — por exemplo, “revise este código considerando segurança, performance e compatibilidade” — você embute esse raciocínio em uma skill reutilizável. Isso reduz ambiguidade, melhora a repetibilidade e separa melhor o que é regra do projeto do que é tarefa pontual. A própria documentação do Claude Code posiciona as skills como a extensão mais flexível do ecossistema, ao lado de CLAUDE.md, hooks, subagentes e MCP.

Outro ponto forte é o controle fino de comportamento. Uma skill pode, por exemplo, ser marcada com disable-model-invocation: true para que só o usuário a execute manualmente, o que é útil em ações sensíveis como deploy ou criação de commit. Também pode usar allowed-tools para limitar as ferramentas liberadas durante sua execução, e context: fork para rodar em subagente isolado, preservando o contexto principal. Esse desenho ajuda a tornar o uso do Claude mais governável e mais previsível.

O papel do `CLAUDE.md` e da estrutura `.claude/`

Antes mesmo de falar em spec-driven, vale entender a arquitetura recomendada pelo Claude Code. O arquivo CLAUDE.md é carregado em toda sessão e serve para guardar convenções duradouras do projeto: comandos de build/test/lint, stack principal, padrões de exportação, formato das APIs e outras regras “sempre válidas”. Já a pasta .claude/ pode armazenar skills, regras por caminho, subagentes, configurações e memória adicional do agente. A recomendação explícita é: use CLAUDE.md para o que vale sempre; mova o que é específico de certas tarefas para skills ou regras mais localizadas.

Essa separação é crucial para manter o contexto sob controle. A própria Anthropic destaca que o context window é um recurso escasso: conforme a janela enche, o desempenho pode degradar, e o agente pode começar a perder instruções ou cometer mais erros. Por isso, um projeto bem estruturado no Claude Code tende a usar CLAUDE.md como camada base, skills como capacidades sob demanda e critérios de verificação claros para cada tarefa.

O que significa trabalhar de forma spec-driven com Claude Code

No contexto do Claude Code, “spec-driven” deve ser entendido menos como uma funcionalidade isolada e mais como uma disciplina de engenharia. Em vez de pedir “implementa isso aí”, você começa por uma especificação explícita: objetivo, requisitos, critérios de aceite, restrições técnicas, comandos de validação, arquivos afetados e, se possível, exemplos ou testes esperados. Essa abordagem conversa diretamente com o loop agentivo descrito pela Anthropic — reunir contexto, agir e verificar — porque fornece ao agente um norte claro para decidir o que ler, o que mudar e como provar que o trabalho foi concluído.

A Anthropic reforça em suas boas práticas que a alavanca mais forte para qualidade é dar ao Claude um jeito de verificar o próprio trabalho, incluindo testes, saídas esperadas e critérios objetivos de sucesso. Em outras palavras, uma boa especificação para Claude Code não descreve apenas “o que construir”, mas também “como saber que está certo”. Isso muda a relação com o agente: em vez de depender de inspeção manual a cada passo, você transforma a tarefa em algo mais verificável e iterável.

Essa lógica aparece também nos textos da Anthropic sobre agentes de longa duração. Em trabalhos mais complexos, eles relatam ganhos ao decompor uma especificação de produto em lista de tarefas e ao usar artefatos estruturados para fazer a passagem de contexto entre sessões. Ou seja: o “spec-driven” é, na prática, o uso de documentos e estruturas intermediárias para evitar que o agente tente fazer tudo de uma vez, perca coerência ou finalize cedo demais.

Por que combinar skills com spec-driven

A combinação é poderosa porque cada peça resolve um problema diferente. O spec-driven organiza o trabalho corrente: o que deve ser feito agora, quais critérios definem sucesso e quais restrições precisam ser respeitadas. Já as skills organizam o conhecimento e o método recorrente: como o seu time costuma quebrar requisitos em tarefas, como validar mudanças em API, como revisar segurança, como preparar uma PR ou como executar um deploy seguro.

Na prática, isso significa que a especificação vira a fonte da verdade daquela entrega, enquanto as skills viram os procedimentos operacionais padrão que o Claude aplica repetidamente. Essa separação evita dois extremos: um CLAUDE.md inchado com instruções demais e prompts gigantescos reescritos a cada demanda. Também melhora a capacidade do agente de atuar em paralelo ou em subagentes, porque cada unidade de trabalho pode carregar uma skill adequada sem poluir o contexto principal.

Estrutura recomendada para um projeto

Uma forma madura de organizar isso é manter três camadas bem distintas:

meu-projeto/
├─ CLAUDE.md
├─ docs/
│  └─ specs/
│     ├─ auth-refresh-token.md
│     └─ billing-retry-policy.md
└─ .claude/
   ├─ skills/
   │  ├─ implement-from-spec/
   │  │  └─ SKILL.md
   │  ├─ review-against-spec/
   │  │  └─ SKILL.md
   │  └─ open-pr-with-checklist/
   │     └─ SKILL.md
   ├─ rules/
   │  └─ testing.md
   └─ settings.json

Nessa organização, o CLAUDE.md guarda convenções permanentes; docs/specs/ concentra as especificações por funcionalidade; e .claude/skills/ encapsula workflows reutilizáveis. Isso segue a lógica do Claude Code para separar contexto persistente, conhecimento sob demanda e customizações do agente dentro da pasta .claude/.

Exemplo de skill para implementação a partir de especificação

Abaixo está um exemplo original de skill que você poderia usar para implementar uma feature a partir de uma spec:

---
name: implement-from-spec
description: Implementa uma funcionalidade a partir de uma especificação em docs/specs e valida com testes.
disable-model-invocation: true
allowed-tools: Read, Grep, Glob, Edit, Write, Bash
context: fork
---

Você vai implementar a funcionalidade descrita em $ARGUMENTS.

Passos obrigatórios:
1. Ler a especificação informada.
2. Identificar requisitos funcionais e não funcionais.
3. Mapear arquivos impactados.
4. Propor plano curto antes de editar.
5. Implementar em pequenos incrementos.
6. Executar testes e validações relevantes.
7. Comparar resultado final com os critérios de aceite da spec.
8. Gerar resumo com:
   - requisitos atendidos
   - pendências
   - riscos
   - testes executados

Essa skill traduz muito bem o espírito spec-driven: ela exige leitura da especificação, transforma requisitos em plano de ação, implementa de forma incremental e fecha com verificação contra critérios de aceite. O uso de context: fork também segue o padrão recomendado para tarefas mais estruturadas que se beneficiam de um subagente isolado.

Exemplo de fluxo operacional

Um fluxo de trabalho simples pode ser:

Registrar a convenção do projeto em CLAUDE.md (build, teste, lint, stack, padrões de PR).
Criar a spec em docs/specs/feature-x.md, incluindo problema, escopo, fora de escopo, critérios de aceite e validação esperada. Essa parte é uma prática de engenharia, mas ela se encaixa diretamente na recomendação de fornecer critérios de verificação claros ao Claude.
Executar uma skill como /implement-from-spec docs/specs/feature-x.md. Como skills podem ser invocadas por slash command e até executadas em subagentes, elas funcionam como um “modo de operação” repetível.
Rodar uma segunda skill de revisão, por exemplo /review-against-spec docs/specs/feature-x.md, para comparar a implementação final com a spec e levantar gaps. Isso reforça o princípio de verificação e fechamento por critérios objetivos.

Benefícios reais dessa abordagem

O primeiro benefício é consistência. Ao encapsular procedimentos em skills, você reduz a variação entre sessões e entre engenheiros. O segundo é governança: dá para controlar quem pode disparar certas ações, quais ferramentas ficam liberadas e quando uma skill deve rodar em contexto isolado. O terceiro é eficiência de contexto, já que o Claude não precisa carregar instruções detalhadas o tempo todo — basta saber que a skill existe e carregá-la quando necessário.

O quarto benefício é escalabilidade em tarefas longas. As publicações da Anthropic sobre agentes de longa duração mostram que decompor o trabalho, usar artefatos estruturados e manter handoffs claros entre etapas melhora muito a coerência em sessões extensas. Esse é exatamente o tipo de ganho que o spec-driven traz quando combinado com skills bem definidas.

Limites e cuidados

Mesmo com uma boa arquitetura, Claude Code continua sendo um agente probabilístico. A própria Anthropic destaca que contexto excessivo degrada performance e que tarefas vagas sem critérios de validação tendem a gerar saídas aparentemente boas, mas incorretas. Por isso, skills não devem virar depósito de instruções gigantes, e specs não devem ser documentos prolixos sem testes ou condições de aceite objetivas.

Outro cuidado é não transformar tudo em skill. A documentação é clara ao diferenciar funções: CLAUDE.md para convenções persistentes; skills para workflows e conhecimento reutilizável; hooks para automações determinísticas; MCP para integrações externas; subagentes para isolamento e paralelismo. Escolher a abstração errada gera mais complexidade, não mais produtividade.

Conclusão

Usar Claude Code com skills e uma abordagem spec-driven é, na prática, transformar um assistente de código em um sistema de trabalho mais confiável. As skills dão forma reutilizável ao seu método; as specs dão direção concreta para a execução; CLAUDE.md garante alinhamento de base; e os mecanismos de subagentes, ferramentas e validação completam o ciclo. O resultado é menos improviso, mais reprodutibilidade e maior capacidade de usar o agente em tarefas reais de engenharia de software.

Multi-Selection Like VSCode in Notepad++

Pinei — Fri, 30 May 2025 01:42:48 +0000

Many developers love the multi-selection feature in Visual Studio Code (VSCode), where pressing Ctrl+D selects the next occurrence of the currently selected text—allowing you to edit multiple places at once.

Notepad++ does not offer this natively, but with the powerful NppExec plugin and some configuration, you can achieve a very similar workflow.

How Multi-Selection Works in VSCode

In VSCode, you can:

Select a word or text fragment.
Press Ctrl+D to select the next occurrence of that text.
Continue pressing Ctrl+D to add more occurrences to your selection.
Edit all selected instances simultaneously.

This feature is invaluable for refactoring, quick edits, or making the same change in multiple places.

How to Replicate Multi-Selection in Notepad++

Notepad++ does not have this feature out-of-the-box, but you can configure it using the NppExec plugin and the Shortcut Mapper. Here’s how:

Step 1: Install NppExec Plugin
- Open Notepad++.
- Go to Plugins > Plugins Admin.
- Find and install NppExec.

If you encounter installation issues, you may need admin rights or to manually copy the plugin files to the Notepad++ plugins directory.

Step 2: Create the Multi-Selection Script
- Go to Plugins > NppExec > Execute....
- In the command window, paste the following script:

npp_console 0
sci_sendmsg SCI_SETSEARCHFLAGS SCFIND_MATCHCASE
sci_sendmsg 2690  // SCI_TARGETWHOLEDOCUMENT
sci_sendmsg 2688  // SCI_MULTIPLESELECTADDNEXT

npp_console 0 hides the NppExec console when running the script.

sci_sendmsg SCI_SETSEARCHFLAGS SCFIND_MATCHCASE sets the search to be case sensitive.

sci_sendmsg 2690 and sci_sendmsg 2688 are Scintilla commands that select the next occurrence of the selected text in the whole document.

Click Save..., give your script a name (e.g., MultiSelectAddNext), and save.

Step 3: Add the Script to the Menu and Assign a Shortcut
- Go to Plugins > NppExec > Advanced Options.
- Under "Associated Scripts", select your script and click "Add/Modify".
- (Optional) Check "Place to the Macros submenu" if you want quick access from the menu.
- Click OK and restart Notepad++ if prompted.
Step 4: Assign a Shortcut Key
- Go to Settings > Shortcut Mapper.
- Go to the "Plugin Commands" tab.
- Find your script (e.g., MultiSelectAddNext).
- Double-click the "Shortcut" column and assign your preferred shortcut (e.g., Ctrl+D).

If Ctrl+D is already used, you may need to assign a different combination.

Usage

Select the text you want to find more occurrences of.
Press your assigned shortcut (e.g., Ctrl+D) to select the next occurrence.
Repeat to select further occurrences.
Edit all selected instances at once—just like in VSCode.

Troubleshooting and Tips

If the shortcut opens the NppExec console, ensure npp_console 0 is at the top of your script.

If your script does not appear in the Shortcut Mapper, make sure you added it via Advanced Options in NppExec.

You can customize the search flags (e.g., for case sensitivity or whole word matching) by changing the value in SCI_SETSEARCHFLAGS.

Removendo arquivos indesejáveis em um repositório Git

Pinei — Wed, 02 Apr 2025 22:57:21 +0000

Não é incomum nos depararmos com a necessidade de remoção de arquivos indesejáveis que foram parar em algum commit no seu repositório Git. Pode ser um arquivo executável sujando seu histórico, ou podem ser arquivos de dados ou de configuração com informações sensíveis.

Há pouco tempo era difícil tratar este tipo de situação. Agora temos ferramentas modernas e eficientes pra nos salvar.

Uma delas é o git-filter-repo. Uma ferramenta de linha de comando para reconstrução do repositório baseada em filtro.

Para instalar podemos usar o pip, um gerenciador de pacotes do Python que é ferramenta cada vez mais presente nos ambientes de desenvolvimento.

pip install git-filter-repo

O git-filter-repo trabalha com um repositório limpo, de preferência recém clonado. É arriscado usar na sua pasta de trabalho pois ele pode fazer um reset sem cerimônia, e arquivos de trabalho importantes serão perdidos.

Faça um clone ou uma cópia limpa do repositório.

Na pasta do clone, supondo que você queira eliminar todos os arquivos .exe, dispare o comando git filter-repo com o filtro.

git filter-repo --path-glob '*.exe' --invert-paths

A ferramenta vai fazer o "trabalho sujo" e deixar o repositório brilhando.

O git filter-repo analisará todo o histórico e removerá as referências ao(s) arquivo(s) especificado(s). Ele também faz uma limpeza automática (como git gc).

NOTICE: Removing 'origin' remote; see 'Why is my origin removed?'
        in the manual if you want to push back there.
        (was ...)
Parsed 43 commits
New history written in 0.43 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects

Adicione ao .gitignore para prevenir futuros commits.

Crie ou edite o arquivo .gitignore na raiz do seu repositório e adicione uma linha *.exe para ignorar os arquivos.

git add .gitignore
git commit -m "Adiciona .exe ao gitignore"

Envie as alterações para o repositório remoto. Como você reescreveu o histórico, um git push normal provavelmente não vai rolar. Você precisará forçar o push.

Forçar o push (--force ou --force-with-lease) sobrescreve o histórico no repositório remoto. Avise todos os colaboradores antes de fazer isso, pois eles precisarão atualizar seus clones locais de maneira especial (geralmente fazendo fetch e reset).

git push origin nome-da-sua-branch --force

E é isso.

Lembre-se sempre do backup e da comunicação com a equipe ao reescrever o histórico.

Reformatting dictionaries in Python

Pinei — Wed, 12 Feb 2025 23:28:24 +0000

Today we'll explore a common problem when working with lists of dictionaries in Python: the need to filter those dictionaries based on a specific set of keys. We'll present two concise and efficient solutions, utilizing powerful features of the language.

The problem

Imagine you have a list of dictionaries, where each dictionary represents a set of parameters. In some situations, you need to extract only the attributes (key-value pairs) that correspond to a subset of keys.

Example:

parameters = [
  { "start": "2020-01", "end": "2020-02", "done": True },
  { "start": "2020-02", "end": "2020-03", "done": True }
]

keys = ['start', 'end']

filtered_parameters = filter_dictionaries(parameters, keys)
print(filtered_parameters)

# Expected output:
# [{'start': '2020-01', 'end': '2020-02'},
#  {'start': '2020-02', 'end': '2020-03'}]

Our goal is to get a new list of dictionaries, containing only the "start" and "end" attributes from each original dictionary.

Solution 1: Traditional iteration with if

The first approach uses a for loop to iterate over the list of dictionaries and another nested for loop to iterate over the desired keys. Inside the inner loop, we check if the key exists in the current dictionary and, if so, add the key-value pair to the new dictionary.

def filter_dictionaries(parameters, keys):
  result = []
  for dictionary in parameters:
    new_dict = {}
    for key in keys:
      if key in dictionary:
        new_dict[key] = dictionary[key]
    result.append(new_dict)
  return result

Solution 2: List Comprehension for the rescue

Python offers a powerful feature called list comprehension, which allows you to create lists and dictionaries concisely and expressively. We can use list comprehension to implement dictionary filtering in virtually a single line of code:

def filter_dictionaries(parameters, keys):
  return [
    {
      key: dictionary[key]
        for key in keys if key in dictionary
    } for dictionary in parameters
  ]

This approach is more compact and, for many developers, more readable than the version with nested loops.

Comparing approaches

Both solutions are efficient and produce the same result. The choice between them is a matter of personal preference. Some may find the list comprehension version more elegant and concise, while others may prefer the clarity of traditional iteration.

Tip: When working with lists and dictionaries, prioritize the use of list comprehension to write cleaner and more precise code.

DeepSeek, Efficiency, and Big Tech’s Response

Pinei — Tue, 04 Feb 2025 04:40:21 +0000

DeepSeek, a rising star in the Chinese AI landscape, has quickly become one of the most downloaded AI apps, sparking discussions about the future of AI investment and resource allocation. What sets DeepSeek apart is its ability to develop a competitive AI model at a fraction of the cost of dominant U.S. models, which often require hundreds of billions of dollars in investment. While the exact cost of DeepSeek’s development is debated — likely higher than the reported $6 million — it remains a testament to the efficiency and innovation achievable under constraints. This efficiency is particularly evident in DeepSeek’s latest model, R1, which many see as a direct response to OpenAI’s o1.

deepseek-ai/DeepSeek-R1

he development of Deepseek underscores how limitations can drive technological breakthroughs. China’s restrictions on advanced GPUs forced Deepseek to focus on software optimization, resulting in a more efficient AI model. This aligns with a broader principle in technology innovation: initial phases of hype and overspending are often followed by periods of refinement and efficiency, especially when resources are scarce.

DeepSeek’s success demonstrates that constraints can foster creativity and lead to solutions that challenge the status quo.

Jevons Paradox and its potential implications for the AI industry

The concept of Jevons Paradox — where increased efficiency leads to greater resource consumption — may hold significant implications for the AI industry. Microsoft CEO Satya Nadella has suggested that this paradox applies to AI, arguing that as models like DeepSeek’s R1 become more efficient and accessible, their use will skyrocket, turning AI into a commodity. This could lead to a surge in AI applications across industries, further accelerating innovation and adoption.

Market Reaction

The emergence of DeepSeek has sent ripples through the tech industry, with major tech and chip companies experiencing a downturn. This shift reflects a growing realization that AI innovation may no longer be as heavily reliant on hardware as previously thought. While this poses challenges for companies heavily invested in hardware, it is ultimately a positive development for the long-term growth of AI, making the technology more accessible and affordable.

Some investors are calling DeepSeek’s rise a “Sputnik moment” for AI, signaling that the U.S. no longer holds a monopoly on AI innovation. DeepSeek has proven that world-class AI models can be developed outside the U.S., even under significant constraints. This shift is likely to encourage greater global participation in AI development, with the open-source community playing a pivotal role in further optimizing and democratizing AI technology.

As the market digests DeepSeek’s impact, all eyes are on upcoming quarterly updates and management calls from tech giants like ASML, Meta, Microsoft, Tesla, and Apple. Analysts will be keen to understand how these companies plan to balance efficiency with their past capital investments in light of DeepSeek’s success. The pressure is on for these firms to demonstrate adaptability and forward-thinking strategies in an increasingly competitive AI landscape.

DeepSeek R1 Model

DeepSeek’s R1 model is a prime example of efficiency-driven AI development. Designed to require only a fraction of the compute resources compared to other models, R1 was reportedly developed on a budget of $5.6 million, far less than the $100 million spent on training ChatGPT-4. The model emphasizes software optimization and leverages constraints, such as limited access to advanced GPUs, to achieve superior performance.

R1’s architecture is both innovative and practical. With 671 billion parameters, it activates only 37 billion during use, making it highly efficient. DeepSeek also offers smaller distilled models, ranging from 1.5B to 70B parameters, based on Qwen and Llama architectures. These models are designed to deliver powerful reasoning capabilities, self-verification, and the ability to generate long chains of thought (CoTs), demonstrating that the reasoning patterns of larger models can be distilled into smaller, more accessible versions.

DeepSeek’s commitment to open-source development is another key factor in its success. The company has made R1-Zero, R1, and six dense models distilled from R1 available to the public, supporting commercial use, modifications, and derivative works. This open approach not only fosters collaboration but also accelerates the pace of innovation within the AI community.

DeepSeek’s models have been rigorously evaluated across a range of tasks, including math, coding, and language understanding. Metrics such as MMLU (Pass@1), DROP (F1), LiveCodeBench (Pass@1-COT), and Codeforces (Rating) highlight the model’s strong performance, particularly in reasoning and problem-solving tasks. These results position DeepSeek as a formidable competitor to established models like OpenAI’s o1.

DeepSeek’s rise marks a significant shift in the AI landscape, challenging traditional notions of resource dependency and innovation. By turning constraints into opportunities, DeepSseek has not only developed a highly efficient AI model but also sparked a broader conversation about the future of AI development. As the industry continues to evolve, DeepSeek’s approach — rooted in efficiency, open collaboration, and adaptability — may well set the standard for the next generation of AI innovation.

Git: recreating the history

Pinei — Sat, 14 Oct 2023 20:49:33 +0000

Why you might want to reset the history of your Git repository and redefine the files that will be versioned?

Reduce Repository Size: Over time, a repository can accumulate a large history with numerous commits, branches, and tagged releases. This can make cloning the repository very slow, especially for new team members. Resetting the history can help shrink the repository's size.
Sensitive Data Removal: If sensitive data like passwords or API keys have been committed to the repository, it's crucial to remove that information. While you can remove this data in a new commit, the sensitive data will still exist in the repository's history. A hard reset is often the safest way to completely remove this data.
Simplify History: A complex and messy commit history can make it difficult for team members to understand the evolution of a project. Resetting the history can simplify it by removing redundant or meaningless commits, making it easier to understand and maintain.
Change of Project Scope: Sometimes projects pivot or change direction. In such cases, old files, dependencies, and even codebases may no longer be relevant. Resetting the history allows you to start anew while keeping the most relevant parts intact.
Licensing or Ownership Changes: In case the project undergoes changes in licensing or ownership, you may need to remove specific parts of the history to comply with legal requirements. This can involve removing commits from contributors who have not agreed to new terms, or removing code that cannot be licensed under the new terms.

Here is a step-by-step guide:

1. Create a new "orphan" branch that does not inherit history.

git checkout --orphan new_branch

2. Remove all files from the Git index (staging area).

This removes files only from the index, not from the file system.

git rm -r --cached .

3. Add the files you wish to version. Take the opportunity to set up .gitignore. Commit these changes.

git commit -m "Initial commit with selected files"

4. Delete the old main branch.

git branch -D main  # replace "main" with the name of your main branch, if different

5. Rename the new branch to be the new main branch.

git branch -m main  # again, replace "main" if necessary

6. Perform a "force push" to the remote repository.

git push -f origin main  # replace "main" and "origin" as needed

These actions are irreversible on the remote repository and can cause issues for other collaborators who are working on the same repository. Measures need to be taken to synchronize their local copies with the new history.

By following these steps, you will create a new main branch with a new set of versioned files, discarding the previous history.

Running PySpark in JupyterLab on a Raspberry Pi

Pinei — Sun, 01 Oct 2023 23:44:42 +0000

While researching materials for installing a JupyterLab instance with Spark support (via PySpark), I noticed a lot of outdated content. That was until I came across an up-to-date Docker image provided by the Jupyter Docker Stacks project.

https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html#apache-spark

Python and Java have good support for ARM architecture, so we can assume that any framework developed on these platforms will run well on a Raspberry Pi. Then, I just ran the Docker command to start the environment.

docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook

It couldn't be easier.

Checking CPU and memory

Let's try a simple code to check the availability of the PySpark and SQL API.

from pyspark.sql import SparkSession

spark = ( SparkSession
    .builder
    .appName("Python Spark SQL basic example")
    .config("spark.executor.memory", "2g")
    .config("spark.executor.cores", "4")
    .config("spark.eventLog.enabled", "true")
    .config("spark.sql.shuffle.partitions", "50")
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .getOrCreate() )

Got a SparkSession-in-memory

Version: v3.5.0
Master: local[*]
AppName: Python Spark SQL basic example

We can setup a dataframe.

from pyspark.sql.types import StructType, StructField, FloatType, BooleanType
from pyspark.sql.types import DoubleType, IntegerType, StringType

# Setup the Schema
schema = StructType([
StructField("User ID", IntegerType(),True),
StructField("Username", StringType(),True),
StructField("Browser", StringType(),True),
StructField("OS", StringType(),True),
])

# Add Data
data = ([(1580, "Barry", "FireFox", "Windows" ),
(5820, "Sam", "MS Edge", "Linux"),
(2340, "Harry", "Vivaldi", "Windows"),
(7860, "Albert", "Chrome", "Windows"),
(1123, "May", "Safari", "macOS")
])

# Setup the Data Frame
user_data_df = spark.createDataFrame(data,schema=schema)
user_data_df.show()

+-------+--------+-------+-------+
|User ID|Username|Browser|     OS|
+-------+--------+-------+-------+
|   1580|   Barry|FireFox|Windows|
|   5820|     Sam|MS Edge|  Linux|
|   2340|   Harry|Vivaldi|Windows|
|   7860|  Albert| Chrome|Windows|
|   1123|     May| Safari|  macOS|
+-------+--------+-------+-------+

We can save this dataframe to a physical table in a new database.

spark.sql("CREATE DATABASE raspland")
user_data_df.write.saveAsTable("raspland.user_data")

And then run SQL commands over the table.

spark.sql("SELECT * FROM raspland.user_data WHERE OS = 'Linux'").show()

+-------+--------+-------+-----+
|User ID|Username|Browser|   OS|
+-------+--------+-------+-----+
|   5820|     Sam|MS Edge|Linux|
+-------+--------+-------+-----+

By accessing the Terminal, we can check the Parquet files that store the data of our created table.

(base) jovyan@0e1d1463f0b0:~/spark-warehouse/raspland.db/user_data$ ls

part-00000-963849eb-9b82-4bf7-ab1c-f71800914e15-c000.snappy.parquet
part-00001-963849eb-9b82-4bf7-ab1c-f71800914e15-c000.snappy.parquet
part-00002-963849eb-9b82-4bf7-ab1c-f71800914e15-c000.snappy.parquet
part-00003-963849eb-9b82-4bf7-ab1c-f71800914e15-c000.snappy.parquet
_SUCCESS

Install the jupyterlab-sql-editor extension to get enhanced functionalities for SQL execution.

Here is a post to learn more about the extension:

https://towardsdatascience.com/jupyterlab-sql-cell-editor-e6ac865b42df

You will need to run 2 commands in the Terminal to install server prerequisites.

pip install jupyterlab-lsp jupyterlab-sql-editor
sudo npm install -g sql-language-server

Load the extension in the notebook ...

%load_ext jupyterlab_sql_editor.ipython_magic.sparksql

and SparkSQL cells will be enabled.

Metastore

Spark should be using a Hive metastore to manage databases and tables. We will talk about Hive later. Until now, our data has been physically persisted in Parquet files, while our metadata remains in memory.

We need to enable Hive Support to persist our metadata. So let's delete our data, recreate our Spark session, and run the sample again.

In the Terminal:

rm -rf spark-warehouse/

Change the code in the notebook to enable Hive:

from pyspark.sql import SparkSession

spark = ( SparkSession
    .builder
    .appName("Python Spark SQL basic example")
    ...
    .enableHiveSupport()
    .getOrCreate() )

After running our database creation script, certain files and folders appear in the file explorer.

While spark-warehouse houses our Parquet files, metastore_db serves as a Derby repository to store our database and table definitions.

Derby is a lightweight relational database management system (RDBMS) implemented in Java. It is often used as a local, embedded metastore for Spark's SQL component when running in standalone mode.

Hive is a data warehousing and SQL query engine for Hadoop, originally developed by Facebook. It allows you to query big data in a distributed storage architecture like Hadoop's HDFS using SQL-like syntax. Hive's architecture includes a metadata repository stored in an RDBMS, which is often referred to as the Hive Metastore.

The standalone installation of Spark does not inherently include Hive, but it has built-in support for connecting to a Hive metastore if you have one set up separately.

Thank you for reading. Let me know in the comments what else you might be interested in to complement this article.

A comparison of SageMaker and Databricks for machine learning

Pinei — Fri, 21 Jul 2023 18:08:05 +0000

SageMaker

Amazon SageMaker is a fully managed service that provides an end-to-end machine learning (ML) platform. It includes a variety of features that help you build, train, deploy, and monitor ML models.

Some of the key features of Amazon SageMaker include:

A wide range of pre-trained models: Amazon SageMaker provides a wide range of pre-trained models that you can use to get started with ML quickly. These models are trained on a variety of datasets, so you can find a model that is relevant to your application.
A variety of algorithms: Amazon SageMaker provides a variety of algorithms that you can use to train your own ML models. These algorithms include linear regression, logistic regression, decision trees, random forests, and support vector machines.
A variety of deployment options: Amazon SageMaker provides a variety of deployment options for your ML models. You can deploy your models to Amazon SageMaker hosting services, Amazon Elastic Container Service (ECS), or Amazon Elastic Beanstalk.
A variety of monitoring tools: Amazon SageMaker provides a variety of monitoring tools that you can use to track the performance of your ML models. These tools include Amazon CloudWatch, Amazon SageMaker Model Monitor, and Amazon SageMaker Anomaly Detection.

SageMaker JumpStart provides a Python SDK with pretrained, open-source models for a wide range of problem types.

Here are some of the benefits of using Amazon SageMaker:

Reduced development time: Amazon SageMaker can help you reduce the development time for your ML models. You can focus on building your models, not on provisioning and managing infrastructure.
Improved accuracy: Amazon SageMaker can help you improve the accuracy of your ML models. It provides a variety of pre-trained models that you can use as a starting point. You can also use Amazon SageMaker's algorithms to train your own models.
Increased scalability: Amazon SageMaker can automatically scale your models up or down based on demand. This can help you save money on infrastructure costs.
Improved security: The service provides a variety of features that can help you protect your models from unauthorized access.

Databricks

Databricks is a cloud-based platform that offers a unified environment for data engineering, data science, and machine learning, self-identifying as a Lakehouse platform. It is built on top of Apache Spark, which is a popular open-source distributed computing framework. The service has grown to support the 3 major public clouds (AWS, Azure, GCP) in several regions around the world.

In the context of ML, Databricks can be used to:

Build and train ML models: Databricks provides a variety of tools and libraries that can be used to build and train ML models. These tools include the MLflow tracking library, which can be used to track the performance of ML models, and the AutoML functionality, which can be used to automatically train and tune ML models.
Deploy ML models: Once an ML model has been trained, it can be deployed to production using the platform. Databricks provides a variety of deployment options, including on-premises deployments and cloud-based deployments.
Monitor ML models: Once an ML model has been deployed, it can be monitored using Databricks. Databricks provides a variety of monitoring tools that can be used to track the performance of ML models, such as the MLflow tracking library and the Databricks Monitoring dashboard.

Here are some of the benefits of using Databricks for ML:

Ease of use: Databricks has a user-friendly platform that is easy to learn and use. This makes it a good choice for organizations that are new to ML.
Scalability: Databricks is a scalable platform that can be used to handle large datasets and complex ML models. This makes it a good choice for organizations that need to scale their ML workloads.
Integration with other tools: The platform integrates with a variety of tools for data sources, BI, development and ETL. This makes it easy to use Databricks with other tools that you are already using.

Differences in the context of ML

Amazon SageMaker and Databricks are both popular cloud-based ML platforms, but they have different strengths and weaknesses.

Amazon SageMaker is a fully managed platform that provides an end-to-end ML solution. It includes a wide range of features for building, training, deploying, and monitoring ML models. SageMaker is a good choice for organizations that want a turnkey solution that they don't have to manage themselves.

Databricks is a more open platform that gives users more control over their ML infrastructure. It includes a wide range of features for data engineering, data science, and ML. It is a good choice for organizations that want a more flexible platform that they can customize to their specific needs.

Here is a table that summarizes the main differences between SageMaker and Databricks:

Feature	Amazon SageMaker	Databricks
Managed vs. self-managed	Fully managed	Self-managed
Features	Wide range of features for building, training, deploying, and monitoring ML models	Wide range of features for data engineering, data science, and ML
Cost	More expensive	Less expensive

The best platform for you will depend on your specific needs and requirements. If you are looking for a turnkey solution that you don't have to manage yourself, then Amazon SageMaker is a good choice. If you want a more flexible platform that you can customize to your specific needs, then Databricks is a good choice.

Some additional considerations that may help you decide which platform is right for you:

Your team's experience with ML: If your team is new to ML, then Amazon SageMaker may be a good choice because it provides a more guided experience. If your team has more experience with ML, then Databricks may be a good choice because it gives you more flexibility.
The size of your dataset: If you have a large dataset, then Amazon SageMaker may be a better choice because it can scale fast to handle larger datasets. Databricks is more cost-effective for smaller datasets.
Your specific requirements: If you have specific requirements, such as the need for a particular algorithm or the need to integrate with a specific third-party tool, then you will need to compare the features of Amazon SageMaker and Databricks to see which platform meets your needs.

Storage and data access

Both Databricks and SageMaker offer a variety of options for storing and accessing datasets for ML activity.

Databricks offers these main options for storing datasets:

Amazon Simple Storage Service (S3): S3 is a highly scalable and durable object storage service. Databricks makes it easy to store datasets in S3 and to access them from Databricks notebooks.
Databricks File System (DBFS): DBFS is a distributed file system that is built on top of S3. DBFS offers a number of advantages over S3, such as the ability to store lineage information and to track changes to datasets.
Azure Data Lake Storage: Databricks also supports Azure Data Lake Storage, which is a cloud-based file storage service from Microsoft.

Main SageMaker options for storing datasets:

Amazon Simple Storage Service (S3): S3 is also a popular option for storing datasets in SageMaker. SageMaker makes it easy to store datasets in S3 and to access them from SageMaker notebooks.
Amazon Relational Database Service (RDS): RDS is a fully managed relational database service that can be used to store datasets for ML activity. SageMaker makes it easy to create and connect to RDS databases.
Amazon Redshift: Redshift is a data warehouse service that can be used to store large datasets for ML activity. SageMaker makes it easy to create and connect to Redshift clusters.

In terms of access, both Databricks and SageMaker offer a variety of ways to access datasets.

Databricks offers the following ways to access datasets:

Notebooks: Databricks notebooks allow you to access datasets from within a "Jupyter-like" notebook environment. This is a convenient way to explore and analyze datasets.
SQL: Databricks also supports SQL, so you can access datasets using SQL queries. This can be useful for working with large datasets or for integrating with other applications that use SQL.
Python APIs: Databricks also offers Python APIs that you can use to access datasets. This can be useful for automating tasks or for integrating with other applications that use Python.

SageMaker offers the following ways to access datasets:

Jupyter notebooks: SageMaker also allow you to access datasets from within a Jupyter notebook environment. This is a convenient way to explore and analyze datasets.
Python APIs: SageMaker also offers Python APIs that you can use to access datasets. This can be useful for automating tasks or for integrating with other applications that use Python.
SageMaker Studio: SageMaker Studio is a web-based IDE that allows you to access datasets and to run ML experiments. This can be a convenient way to work with datasets if you are not familiar with Jupyter notebooks or Python.

Ultimately, the best way to store and access datasets for ML activity will depend on your specific needs and requirements.

Free Oracle Database (How to)

Pinei — Wed, 02 Sep 2020 05:31:30 +0000

This article is a simple guide to getting started with 2 database models in the Oracle Cloud using the Always Free Tier program.

I don't work for Oracle but I have worked with Oracle tools on premises for years. So it's time to check what Oracle Cloud has to offer, at least in terms of database. And no more Oracle XE in the local machine 🎆 .

You will need a credit card to open an account with Oracle Cloud, but the samples we are going to create are cost free.

I already created an account so I will not show how to do it here. So we can jump to the console.

As we can check in the console below, at the red mark, a 200 USD value trial are offered to new users for 30 days free. Be aware of the end of this period to not be surprised with the credit card later in case you keep some activated resources.

The green mark (any button) send us to the Create Autonomous Database screen.

Oracle Cloud Infrastructure's Autonomous Database is a fully managed, preconfigured database environment with two workload types available, Autonomous Transaction Processing and Autonomous Data Warehouse.

After setting the basic name and display name of the database, we can choose the workload type.

Transactional Database and Data Warehouse are both systems that store data. But they serve very different purposes. (The Difference Between a Data Warehouse and a Database)

JSON storage isn't available for the Always Free suite at this time.

Autonomous Database can be used without charge as part of Oracle Cloud Infrastructure's suite of Always Free resources. Users have access to two Always Free instances of Autonomous Database. Always Free Autonomous Databases have a fixed 8 GB of memory, 20 GB of storage, 1 OCPU, and can be configured for either Autonomous Transaction Processing or Autonomous Data Warehouse workloads.

So we can create a database for Transaction Processing and another for Data Warehouse as named below.

Display Name	Name
Oracle Data Warehouse 20 GB 19c Shared Infrastructure	DBWH20GB19cSHR
Oracle Transaction Processing 20 GB 19c Shared Infrastructure	DBTP20GB19cSHR

Each database will have an ADMIN (default) user. Let's assume that we will use the database name as the password. 🤪

Our databases are ready. Keep in mind that they will be automatically stopped if not used for 7 days.

Connecting to the databases

Oracle offers some tools to visualize and administer our databases.

I didn't test any of these tools yet. In general I prefer to use a local database tool. And I think DBeaver is the best free tool available for relational database managers. All we need is a JDBC driver and a way to connect to the database wherever it be.

The Oracle Autonomous Databases mandates a secure connection. For this connection type, Java applications that use a JDBC Thin driver require either Oracle Wallet or Java KeyStore (JKS). (JDBC Thin Connections and Wallets)

If you don't understand those methods, it's not a problem. We will use Oracle Wallet method in this article and you are going to see it's simple.

Oracle Wallet provides a simple and easy method to manage database credentials across multiple domains. It allows you to update database credentials by updating the Wallet instead of having to change individual datasource definitions. This is accomplished by using a database connection string in the datasource definition that is resolved by an entry in the wallet.

The wallet and keystore files are included in the client credentials .zip file that is available by clicking DB Connection on the Oracle Cloud Infrastructure console.

Select Regional Wallet to get the credentials for both databases in the same .zip.

We need a password for the wallet. I will use myWallet1 for this sample. This will be the keyStore and truststore password.

It looks like that the .zip includes some Oracle configuration files and some binary files for SSL certificates.

Method	Files
Oracle Wallet	ewallet.sso, ewallet.p12
Java KeyStore (JKS)	truststore.jks, keystore.jks

Despite the .zip filename, the tnsnames.ora includes connection strings for both databases, as we selected a Regional Wallet for download. For each database we have three or five database service names like (for instance) high, medium, and low. The predefined service names provides different levels of performance and concurrency for Autonomous Databases.

dbtp20gb19cshr_high = (description= ... )
dbtp20gb19cshr_low = (description= ... )
dbtp20gb19cshr_medium = (description= ... )
dbtp20gb19cshr_tp = (description= ... )
dbtp20gb19cshr_tpurgent = (description= ... )

dbwh20gb19cshr_high = (description= ... )
dbwh20gb19cshr_low = (description= ... )
dbwh20gb19cshr_medium = (description= ...)

We can use any service name to test our connection, but each one may serve for a different purpose in a real application as you can see in the documentation (links below).

Documentation
Predefined Database Service Names for Autonomous Data Warehouse
Predefined Database Service Names for Autonomous Transaction Processing

It's time to configure our DBeaver. The default JDBC Oracle Driver provided by the tool doesn't support Oracle Wallet. I had to download Oracle Database 19c (19.6) JDBC Driver.

Go to Tools → Driver Manager in DBeaver to create a copy of a Oracle driver configuration.

For this new configuration, change the default files to the JAR files included in 19.6 JDBC .zip driver.

Name it Oracle 19.6.

In Database Navigator, Create a New Connection. Select our new driver.

We will provide a custom JDBC URL for our Transaction Processing database connection. For the URL and other fields we need some simple informations:

Info	Value
Service name	dbtp20gb19cshr_tp
Wallet location path	/Users/.../Downloads/Wallet_DBTP20GB19cSHR
Username	ADMIN
Password	DBTP20GB19cSHR (we set our database name 🤪 )

This is the shape of the URL for Oracle Wallet:

jdbc:oracle:thin:@dbtp20gb19cshr_tp?TNS_ADMIN=/Users/pinei/Downloads/Wallet_DBTP20GB19cSHR

After filling the fields, test the connection. For the first connection there can be a little lag.

We can click Finish to create the connection configuration.

We can repeat the process for the Data Warehouse Database using these values:

Info	Value
Service name	dbwh20gb19cshr_medium
Wallet location path	/Users/.../Downloads/Wallet_DBTP20GB19cSHR
Username	ADMIN
Password	DBWH20GB19cSHR (we set our database name 🤪 )

And URL:

jdbc:oracle:thin:@dbwh20gb19cshr_medium?TNS_ADMIN=/Users/pinei/Downloads/Wallet_DBTP20GB19cSHR

We can rename the connections afterward.

We didn't need the wallet password for our configuration, but the Java KeyStore (JKS) method makes use of it. If you want to try, see JDBC Thin Connections and Wallets

This article won't stop here, but with this content I decided to publish. More to come ...

First steps with Haskell

Pinei — Wed, 19 Aug 2020 05:02:12 +0000

This is my first article at Dev.to. I'm not used to writing in English, but I am thinking of get some practice while learning pure functional programming with Haskell.

Haskell is a 90's programming language characterized by:

Multi OS (Windows, Mac, Linux)
General purpose (like C#, Java, C, C++)
Purely functional (like Elm)
Strong static typing (like Java, C, C++)
Type inference (like OCaml, Haskell, Scala, Kotlin)
Lazy evaluation
Concise programs

Haskell has a diverse range of use commercially, from aerospace and defense, to finance, to web startups, hardware design firms and lawnmower manufacturers (Haskell Wiki)

The language is widely known as a functional programming language which is great to write pure code and immutable data structures. It has a great support for a wide variety of concurrency models too.

Functional programming is a style of programming in which the basic method of computation is the application of functions to arguments, encouraging us to build software by composing functions in a declarative way (rather than imperative), avoiding shared state and mutable data.

In a pure functional language, you can't do anything that has a side effect. As a benefit, the language compiler can be a set of pure mathematical transformations. This results in much better high-level optimization facilities.

A side effect would mean that evaluating an expression changes some internal state that would later cause evaluating the same expression to have a different result. In a pure functional language we can evaluate the same expression as often as we want with the same arguments, and it would always return the same value, because there is no state to change. (What does “pure” mean in “pure functional language”?)

In Haskell, even functions based on system inputs and outputs (like print, getChar and getSystemTime) are pure. They handle IO actions (where IO is a monad, like Javascript Promises is) which can be combined and performed by runtime. We can say that getSystemTime is a constant which represents the action of getting the current time. This action is the same every time no matter when it is used. (IO Inside)

A language is statically typed if the type of a variable is known at compile time. For some languages this means that you as the programmer must specify what type each variable is (e.g.: Java, C, C++); other languages offer some form of type inference, the capability of the type system to deduce the type of a variable.

The main advantage here is that all kinds of checking can be done by the compiler, and therefore a lot of trivial bugs and a very large class of program errors are caught at a very early stage. (What is the difference between statically typed and dynamically typed languages?)

Lazy evaluation means that expressions are not evaluated when they are bound to variables, but their evaluation is deferred until their results are needed. When it is evaluated, the result is saved so repeated evaluation is not needed.

Lazy evaluation is a technique that can make some algorithms easier to express compactly and much more efficiently. It encourages programming in a modular style using intermediate data structures, and even allows data structures with an infinite number of elements. (Programming in Haskell)

Imagine an infinite data structure like a Fibonacci list. We don't actually figure out what the next element is until you actually ask for it. (Fibonacci, Using Lazy Evaluation)

let fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
take 10 fibs        -- [0,1,1,2,3,5,8,13,21,34]
fibs!!10            -- 55

We have to understand some functions and operators to interpret the code above.

the : operator prepends an element to a list
tail function returns a list without its first element
zipWith uses a specified function (in this case addition) to combine corresponding elements of two lists (in this case fibs and tail fibs) to produce a third
take returns a sized prefix of a list
!! is the list index (subscript) operator, starting from 0

Another nice example is a definition for an infinite list of integers (starting at 2) where each number is not divisible by any previous number in the list.

:{
primes = filterPrime [2..]
  where filterPrime (p:xs) =
          p : filterPrime [x | x <- xs, x `mod` p /= 0]
:}
take 10 primes      -- [2,3,5,7,11,13,17,19,23,29]

Due to high-level nature of the functional style, programs written in Haskell are often much more concise than in others languages. The language syntax was designed with concise programs in mind.

These are the main concepts for now. To start testing and learning, let’s get our hands dirty.

The Haskell platform

A Haskell distribution may includes tools like:

the Glasgow Haskell Compiler
the Cabal build system
the Stack tool
core & widely-used packages

Glasgow Haskell Compiler (GHC) is an open source compiler and interactive environment (aka GHCi) for the language Haskell.

Let's see some concepts from the Cabal build system.

CABAL (the spec) is the Common Architecture for Building Applications & Libraries. It’s a specification for defining how Haskell applications and libraries should be built, defining dependencies, etc.

.cabal (the file format) is used to write down the definitions for a specific package.

library
  exposed-modules:     HelloWorld
  -- other-modules:
  -- other-extensions:
  build-depends:       base >= 4.11 && <4.12
  hs-source-dirs:      src
  default-language:    Haskell2010

The Cabal library implements the above specification and file format.

The cabal executable (cabal-install), is a command-line tool that uses Cabal (the library) to resolve dependencies and build packages.

What about Stack?

Stack is a replacement for cabal-install, i.e. the cabal executable. It is a build tool that works on top of the Cabal build system, with a focus on reproducible build plans and multi-package projects.

We can also count on online package repositories to complement the build system:

The Hackage package repository, providing more than ten thousand open source libraries and applications to help you get your work done

The Stackage package collection, a curated set of packages from Hackage which are regularly tested for compatibility. Stack defaults to using Stackage package sets to avoid dependency problems.

Starting with Stack

In my computer I have the brew tool installed (a package manager for macOS and Linux)

With a simple command I was able to download and install the mentioned tools to start with Haskell.

$ brew install haskell-stack
...
🍺  /usr/local/Cellar/haskell-stack/2.3.3: 6 files, 55.4MB

A stack command starts our Haskell Interpreter.

$ stack ghci
...
Configuring GHCi with the following packages:
GHCi, version 8.8.3: https://www.haskell.org/ghc/  :? for help
Prelude>

Prelude is a module that contains a small set of standard definitions and is included automatically into all Haskell modules.

Let's do a simple test and quit.

Prelude> let x = 12 in x / 4
3.0
Prelude> :quit
Leaving GHCi.

The first project

With Stack and a template we can generate a start project inside our current directory, but it suggest us to config some properties in advance. We can set these properties in the ~/.stack/config.yaml file.

templates:
  params:
    author-name: Aldinei
    author-email: aldinei@gmail.com
    copyright: none
    github-username: pinei
    category: Development

We can use stack new to create our first Haskell project called haskell-kitchen.

You can access stack-templates to search for another template. Here the default "new-template" will be used.

$ stack new haskell-kitchen
Downloading template "new-template" to create project "haskell-kitchen" in haskell-kitchen/ ...
Looking for .cabal or package.yaml files to use to init the project.
Using cabal packages:
- haskell-kitchen/
...
Initialising configuration using resolver: lts-16.10
Total number of user packages considered: 1
Writing configuration to file: haskell-kitchen/stack.yaml
All done.

The stack new command should have created the following files:

├── app
│   └── Main.hs
├── ChangeLog.md
├── LICENSE
├── haskell-kitchen.cabal
├── package.yaml
├── README.md
├── Setup.hs
├── src
│   └── Lib.hs
├── stack.yaml
└── test
    └── Spec.hs

We can see some text files and source folders (app, src and test) containing .hs files.

The package.yaml is the preferred package format provided by Stack. The default behaviour is to generate the .cabal file from this package.yaml, so you should not modify the .cabal file. It is updated automatically as part of the stack build process.

The Setup.hs file is another component of the Cabal build system. It's technically not needed by Stack, but it is still considered good practice in the Haskell world to include it.

The project descriptor named stack.yaml gives our project-level settings.

In the src\Lib.hs source file, let's type a text in the function someFunc of the Lib module to see it printed to the output later.

module Lib
    ( someFunc
    ) where

someFunc :: IO ()
someFunc = putStrLn "Welcome to Haskell Kitchen"

Let's see the app\Main.hs file:

module Main where

import Lib

main :: IO ()
main = someFunc

The function named main in the module Main is special. It is defined to be the entry point of a Haskell program (similar to the main function in C), and must have an IO type, usually IO ().

Our module Lib is imported here and the function someFunc will be invoked.

Now let's see the test\Spec.hs file.

main :: IO ()
main = putStrLn "Test suite not yet implemented"

It would be interesting to write some tests to learn and document some codes, but with this sample we haven't any clue about how to write the test cases. It's a subject for another article.

Let's use Stack to run our hello-world application. Type stack build to generate the binary ...

$ stack build
haskell-kitchen> configure (lib + exe)
Configuring haskell-kitchen-0.1.0.0...
haskell-kitchen> build (lib + exe)
Preprocessing library for haskell-kitchen-0.1.0.0..
Building library for haskell-kitchen-0.1.0.0..
Preprocessing executable 'haskell-kitchen-exe' for haskell-kitchen-0.1.0.0..
Building executable 'haskell-kitchen-exe' for haskell-kitchen-0.1.0.0..
[1 of 2] Compiling Main [Lib changed]
Linking .../haskell-kitchen-exe ...
haskell-kitchen> copy/register
Registering library for haskell-kitchen-0.1.0.0..

The binaries for the local platform (in my case x86_64-osx) are generated somewhere in the folder .stack-work under the project folder.

The command stack exec helps us to find the executable and run it ...

$ stack exec haskell-kitchen-exe
Welcome to Haskell Kitchen

The output is showing as expected. In a near future we can build a functional application.

For now let's test some code writing a new module in src\Calculator.hs for simple recursive functions.

module Calculator
    ( double, sum, factorial, product
    ) where

double :: Num a => a -> a
double x = x + x

sum_list :: Num a => [a] -> a
sum_list[] = 0
sum_list(x:xs) = x + sum_list xs

product_list :: Num a => [a] -> a
product_list[] = 1
product_list(x:xs) = x * product_list xs

factorial :: Integral a => a -> a
factorial x = product_list [1..x]

We can try the functions using the GHCi environment. Execute stack ghci in the project folder.

$ stack ghci
Using main module: 1. Package `haskell-kitchen' component haskell-kitchen:exe:haskell-kitchen-exe with main-is file: ./haskell-kitchen/app/Main.hs
haskell-kitchen> configure (lib + exe)
Configuring haskell-kitchen-0.1.0.0...
haskell-kitchen> initial-build-steps (lib + exe)
Configuring GHCi with the following packages: haskell-kitchen
GHCi, version 8.8.3: https://www.haskell.org/ghc/  :? for help
[1 of 3] Compiling Calculator       ( ./haskell-kitchen/src/Calculator.hs, interpreted )
[2 of 3] Compiling Lib              ( ./haskell-kitchen/src/Lib.hs, interpreted )
[3 of 3] Compiling Main             ( ./haskell-kitchen/app/Main.hs, interpreted )
Ok, three modules loaded.
*Main Calculator Lib>

We now have 3 modules compiled and loaded, and we are able to invoke the functions from the Calculator module.

The Prelude module in the prompt gave way to a list of our loaded modules Main Calculator Lib. We can configure the prompt using :set prompt "> ".

*Main Calculator Lib> :set prompt "> "
> Calculator.double 3
6
> Calculator.sum [1..5]
15
> Calculator.product [2, 3]
6
> Calculator.factorial 5
120
> :quit
Leaving GHCi.

It seems our functions are working well, but we can make our project better with unit tests. I have to search for options before writing a second article in a series.