Forem: hannahyan

Brain-inspired agentic memory

hannahyan — Sun, 01 Jun 2025 06:58:13 +0000

Memory brings stateful continuity to agentic systems. Unlike an infinitely long context window, it evolves and is portable across sessions. It could be short-term or long-term, storing user preferences and abstracted knowledge. But what kind of architectures enables LLM to retain and prioritize information across multiple interactions and sources?

In the Advanced LLM Agent open course by Berkeley, a neurobiology-inspired long-term memory is introduced. It includes two main components: memory encoding and memory retrieval.

Offline Indexing (Memory Encoding)
An instruction-tuned LLM processes a corpus of text and extracts a schemaless knowledge graph (KG) from each passage using Open Information Extraction, treating noun phrases as nodes and their relationships as edges. This KG acts as an artificial hippocampal memory index. Retrieval encoders then enrich the KG by adding edges between semantically similar noun phrases, simulating pattern completion. The result is a unified graph that integrates knowledge across the entire corpus, structurally mirroring the brain's memory formation process.

Online Retrieval (Memory Retrieval)
When a query is given, an LLM identifies key named entities and matches them to related nodes in the KG. These nodes serve as cues to initiate retrieval. HippoRAG then runs Personalized PageRank (PPR) from these nodes to explore the KG, distributing relevance across connected nodes and mimicking context-based memory recall. Node specificity adjusts the influence of each seed node, favoring more distinctive concepts. The final PPR distribution determines which passages are most relevant based on the density of high-ranking KG nodes they contain.

Brain functions
These components directly correspond to brain functions: the neocortex handles abstraction, the parahippocampal region manages contextual associations, and the hippocampus serves as the indexing center. By mirroring this division of labor, HippoRAG enables biologically inspired memory retrieval grounded in cognitive science.

Pros & Cons
These systems offer significant advantages: continuous learning without catastrophic forgetting, efficient handling of partial queries, and transparent retrieval processes. With these, it enables multi-hop reasoning in a single retrieval system, with lower latency and cost.

However, they also present challenges: implementation complexity, resource requirements, and potential scalability issues with extremely large datasets.

** Reference **
https://mem0.ai/blog/memory-in-agents-what-why-and-how/
HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models

4 tools to boost data science reproducibility

hannahyan — Tue, 08 Dec 2020 05:27:48 +0000

In a data science project, it's typically important to have reproducibility and minimize manual work. That involves a number of things from config to parameterization. Here are a few tools to improve that workflow.

Pipenv

Before pipenv, the main way to set up a project would be activating a virtual environment, put all the needed packages in a requirements.txt and it might also incur some manual work, as pip freeze > requirements.txt would bring in a lot of extra packages.

The elegant thing about pipenv, in its own words, is that it "automatically creates and manages a virtualenv for your projects, as well as adds/removes packages from your Pipfile as you install/uninstall packages". It feels like npm in the python world.

So instead of the usual pip install one would now use pipenv install within the virtual environments, and there's no need to manually maintain a requirements.txt file anymore.

However, it goes without saying that if you didn't specify the package version during installation, pipfile won't carry the version either. Instead, it will show a generic one like this:

[packages]
pandas = "*"
sklearn = "*"

Note: this is not to be confused with pyenv which is another library for switching between different versions of python.

Makefile

If there are multiple repeated steps in a project, such as setup, format, test, lint, deploy, then a Makefile would concurrently save time for the project owner and make it user-friendly for the users.

A sample snippet of makefile in combination with the aforementioned pipenv would look like this

setup:
    pip install pipenv
    pipenv install
    pipenv shell
format:
    black *.py
    pylint --disable=R,C sample 
all: setup format

And by running one line of script make all, one can do everything in the makefile – set up a virtual environment, install all the dependencies, auto-format the code, and lint the script to show any unused libraries/variables, etc.

config.yml + argparse

Sometimes one wants to run multiple experiments in a machine learning project or conduct the same analysis on different datasets. In this case, it would be good to keep the different configurations for each of the inputs instead of overwriting them. One can achieve that by having a configs folder with several config.yml, each containing a set of inputs. And in the main script, one could use argparse to load the config.yml as a dictionary and then refer to the specific parameters in the config

parser = argparse.ArgumentParse(description=__doc__)
parser.add_argument("-f", "--config_file",
    default = "configs/config1.yml", help="Configuration file to load.")
ARGS = parser.parse_args()
with open(ARGS.config_file, 'r') as f:
    config = yaml.full_load(f)
print(f"Loaded configuration file {ARGS.config_file}")

In this way, each of the settings is preserved rather than overwritten, and it becomes easier to track the outcome.

papermill

If the setup above helps boost productivity and reproducibility for python scripts, then this library is for ipython notebooks, which might be used at the initial stages of the project due to its interactive nature. By defining all the variables/parameters in a cell in ipynb, and create a runner.ipynb, one can specify new parameters and generate a new notebook for each set of these new parameters.

One can generate a new notebook like this

import papermill as pm
pm.execute_notebook(
   'path/to/input.ipynb',
   'path/to/output.ipynb',
   parameters=dict(alpha=0.6, ratio=0.1)
)

This can be handy and time-saving compared to the alternative of duplicate-altering-renaming a set of highly similar, especially a large number of notebooks different only in some variables.

Overall I find these tools useful in maintaining dependencies, tracking inputs, and reducing repetition.

Image credit: unsplash

10 tips for current data science students

hannahyan — Fri, 15 May 2020 20:56:57 +0000

Graduating during the time of quarantine has been a unique experience. As I reflect on the past years, I'd like to share tips and advice for people interested in data science. I hope those are valuable regardless of whether you are pursuing an advanced degree or MOOC in applied data science.

1. prioritize your learning plan

Data science is an evolving field that roughly subdivides into 3 subfields:

Analytics (expect to be well-versed in business lingos and non-technical communication)
Inference (expect to understand experimentation, causality and statistical modeling)
Machine learning (expect to know ML/DL algorithms at a deep level. Understand cloud technologies can be helpful too. Might need to specialize in a field such as computer vision or NLP)

Sometimes a role might entail a combination of these. Other related fields include data engineering, software engineer, or data visualization.

Knowing which track you are most interested in can help. Personalized education is not yet a reality, so you might want to DIY a curriculum. Many programs offer independent studies, where you can decide on a project and a lecturer will provide guidance. Another setting I liked is the inversed classroom where students learn from each other and the lecturer play a facilitator role.

Unique tips
Getting into a lab or collaborating with a research team can help develop domain knowledge even if your goal is not research. By default, you are most likely to be dedicated to applied/practical problems. So considering factoring some time for those too.

2.Balance exploitation vs exploration

There will be no shortage of learning but a shortage of time. Given the limited energy, you can consider balancing these two to achieve the breadth and depth:

exploitation (do only what you are good at)
exploration (try a bit of everything else)

3. Sharing and communicating

Sharing either concepts or projects are helpful in two ways:

helps you fortify learning and identify weaknesses.
your learning journey, thought process, trials and tribulations are helpful to others. You do not have to be the world's top expert before you can share learning or provide advice. Your experience can be valuable for those who yet to have it, and vice versa.
at work you will notice that verbal, visual, and written communication skills matter as much, if not more, than your Python/R skills. And it is important to communicate the impact in a concise way.

Unique tips
If possible, explaining the concepts as simply as possible. There are no shortage of technical papers, but few can explain complex concepts like to 4-year-olds. This is probably why youtube channels like 3blue1brown, 2 Minute Papers and StatsQuest are popular. Unless your goal is academic publishing, you might want to make it more accessible.

4. Everyone can have a portfolio

Neither design nor web dev skill should be a hurdle for you to start a portfolio that is uniquely you. I made my first portfolio years ago when I barely know CSS which later evolved a bit.

Recommended tools
After I learned more about design and web, I made this table of recommended tools for building a portfolio depending on whether you have design/web dev experience or not.

5. Two ways of approaching challenges

Before I chose my first elective my advisor asked me if I prefer to ease in or ease out the workload. Two ways of facing challenges and getting un-stuck are:

Decrescendos: if you habitually solve hard challenges, some day you will find everything easy. This is applicable is you are moving from hard sciences to other fields.
Crescendos: if the topic at hand is too challenging, find a simpler explanation first. After establishing some foundation, tune up the level. This is applicable in more scenarios.

6. Plan ahead on stress relief

Learning is a life long journey but life is more than learning. So it is important to find a balance, and seek help early when you stress out. You can plan out different pathways to peace in advance, and deploy them when stress levels go through the roof, as it will happen.

7. Get involved in communities and conferences

If you have a local or online community, there are ways to help organize or get involved. If not you can find 2-3 like-minded friends to explore topics of interest and give each other feedback.

Going to conferences allow you to immerse in different topics and talk to people from different perspectives. Many conferences offer student volunteering opportunities, which come with a free ticket.

Each conference has its own persona. Last year a lecturer recommended me attend the Eyeo festival for data visualization. It is a great sense of community. I then went to Strata Data Conference in NYC, which has a more commercial feel. This year I volunteered at data science conference ODSC East, where most speakers are PhDs working in the industry, and quite a few are involved in open-source and open-tech. I've seen other more academic-minded graduate students presenting at conferences like ICML etc.

For introverts, online conferences might be easier to handle. And organizers are taking efforts to make the online experience immersive.

8. Start early

There is a natural tendency to procrastinate. Strategic procastination might even be conducive to creativity. But the field of data science moves very fast, and there is an influx of people from other fields. There might be PR and economic reason for this. But I also choose to think it is partly because there is a real value in leveraging data to solve organization problems, commericial or social. Taking stock of the competition, you may want to start early in whatever you are doing.

9. Own it

In grad school study is largely self-directed. As such you need to be disciplined in owning your progress, ideas and unique contributions. This might be particularly relevant in the post-pandemic virtual new norm.
Recommended tools
Clockify - a browser plug-in for tracking time
Evernote for note taking

10. Think beyond technicalities

There are often human behind the numbers and algorithms. It is good to be aware of the biases, privacy concerns and consequences.
Some of the recommended books (I've yet to read):
Weapons of Math Destruction
Invisible Women: Data Bias in a World Designed for Men.

Cover image credit: Photo by Matt Howard on Unsplash

I recently finished Udacity Machine Learning Engineer nanodegree and here’s my experience

hannahyan — Sun, 03 May 2020 19:13:26 +0000

I want to know more about how to deploy ML algorithms in a web app. And that prompted me to enroll in this course.

The course assumes certain prior foundation and focuses on how to use AWS Sagemaker.

An ultra-compressed guide for Sagemaker

Sagemaker is a development platform for ML practitioners.

Ease of deployment: it enables real-time inference and batch transformation
Variety: It offers both high- and low-level API as well as pre-built models on Marketplace. The SageMaker autopilot offers automatic build, train, and tune functionalities for tabular data.
Auto-scaling: The model runs on container clusters, enabling it to deliver high availability.

The environment is akin to ipynb with a somewhat different workflow and syntax. Data are often read in from S3 after serialization. One can also optimize the models by targeting certain recall, or specifying the imbalance treament.

One can either load a prebuilt model, or create a training object called Estimator. After training, one can deploy the model endpoint as a predictor and run inference on it. It's also important to delete the endpoint after training since they charge by the time in use.

Topics and Projects

The syllabus is very well structured. Here are the main topics covered and the related project built.

How data scientists can leverage DevOps principles. It has a section covering how to make a python package – including testing, logging, and uploading to PyPI. This came in handy later on when I built a clustering package with a statistician teammate. The package has the advantage of not having to pre-specifying the number of clusters over KMeans.
How to deploying models as API endpoint. The corresponding project I completed is a plagiarism detector. It was trained and deployed on Sagemaker, invoked via lambda and API gateway.
How to plan a project end-to-end. The program contains a capstone with a few recommended options and a customized option. One needs to write a proposal of project plan before proceeding, and then submit the project with a report. I find the rubric of the report detailed and insightful. The corresponding project I did is dog breed classifier with dataset containing 133 types of dogs. It could be a good starting point for other image-related applications. Taking a spin from the original Kaggle competition, if supplied an image of a human face as detected by openCV, it will also identify the resembling dog breed.

What’s unique

Overall I had a positive experience. After submitting the project, one can receive some quick and detailed feedback. There are forums where one can discuss with TA or help each other out.

What stood out is that they also provide the rare offering of provide Github and Linkedin review. I received a few helpful feedback on Github usage, including suggestions on writing commit message in the style of:

feat: a new feature
fix: a bug fix
docs: changes to documentation
style: formatting, missing semicolons, etc; no code change
refactor: refactoring production code
test: adding tests, refactoring test; no production code change
chore: updating build tasks, package manager configs, etc; no production code change

While I like the style, I think the it can be further expanded to the data science context. Common code chunks such as feature selection or model interpretation won't fit in to either feat or chore.

How does it compare with graduate-level courses

What’s unique about Udacity is that it is very pertinent to the industry and the materials are continuously updated. The videos are clearly explained, highly digestible, with extra readings and resources. And they involve significantly less mathematical derivation. All the projects are hands-on.

Usually university courses on machine learning don't cover deployment, so this is more like a mixture of a machine learning and a AWS cloud computing course with a small dose of software development.

Another difference is that its program typically has a product tie-in. The MLE program, for example, is tied-in with the commercial software from AWS. In classroom setting, on the other hand, courses tend to be more vendor-agnostic, and could cover AWS, GCP, Azure across the board.

These products, after all, are only means to an end. Tools change, techniques evolve, but thought process and knowledge stays. It is more important to understand the field on both theoretical and practical grounds than simply mastering the tools in the field. This might not necessarily be a downside, as long as one can transfer the learning onto other cloud platforms.

Who is it suitable for

The program is for someone who already had a first course in ML and intermediate Python skills, and wants to learn deployment/production. I did it during the free month and appreciate the fact that they have such a program.

What's next

I find it helpful to understand the entire ML workflow, which may potentially expand the kind of personal projects I can do.

Getting started in building & deploying interactive data science apps with Streamlit - Part 2

hannahyan — Tue, 10 Mar 2020 04:40:43 +0000

After building a data science app, we want to make it accessible instead of gathering dust in a code dump.

In the previous post, we used Streamlit to quickly spin up a slick-looking app with straightforward Python script, and not even a single line of React.js. Next, we will go over several methods of deployment: hosting it on Heroku, containerizing it with Docker, plus running it from Github direct as an easy way to share.

🍳Prerequisite

A Github/Gitlab account
The files in this gist covered from the previous post

☁️Deployment

Among the several deployment options, I find Heroku one of the most beginner-friendly. Before we start, we need to register an account on Heroku, install its CLI, and create a project folder on its UI with a preferred folder name.

Hosting on Heroku

Two additional files from the Gist are needed in the project folder.

Procfile

web: sh setup.sh && streamlit run app.py

setup.sh

mkdir -p ~/.streamlit/

echo "\
[general]\n\
email = \"your-email@domain.com\"\n\
" > ~/.streamlit/credentials.toml

echo "\
[server]\n\
headless = true\n\
enableCORS=false\n\
port = $PORT\n\
" > ~/.streamlit/config.toml

Thereafter in your local terminal, run:

git init

If you are working from your local repo, you could skip the git init part.
Then followed by

heroku git:remote -a <heroku_project_name>
git add .
git commit -am “<commit message>”
git push heroku master

And now instead of only having the app in your local browser, you have a live url to share. It resembles .herokuapp.com.

💡 Tips and Tricks

Like most cloud platforms, Heroku has the function of auto-deploy through Github integration, under the Deploy tab. It's good to test the code beforehand.

Deploy with Docker via GCP or AWS cloud services

Why Docker

While you could have other users git clone your repo and run it locally, they might run into all sorts of issues - such as installing certain packages, or some wrong version might throw everything off, or they could be on a different operating system. All the things you don't want to see happening after all that work. Having the app/code easily accessible & reproducible also can makes it easier for conducting user testing and gather feedback.

Two main concepts

Container: a Docker container encapsulates an application together with the packages and libraries it requires to run
Image: a snapshot of a container

Getting ready

To begin with, we can register a DockerHub account and create a repo there.

The two options – public and private, are very similar to Github's options. One could grant access to specific people for their private repo. For now, we will create a public one for demo purpose.

Next you will need a Dockerfile, such as this one.

Now we are ready to Dockerize the app, there are just a few key commands to know. Since Docker/Kubernetes (the thing that orchestrates multiple Dockers) can get complex, it is better to start from the bare essentials.

I prefer to work from GCP cloud console or AWS cloud9 so I don't need to install Docker, but you can work from local terminal if you already have Docker installed.

Main commands

In cloud console, choose a project, clone a repo if you have the code on Github or upload the project folder.

docker login

It will prompt for Docker Hub username and password.

Build

docker build --tag=<DockerHub Account Name>/<DockerHub Repo Name>:0.0 .

You could specify a version number or use latest.

Check

docker images

You should be able to find the image you just built under REPOSITORY

Push

docker image push <DockerHub Account Name>/<DockerHub Repo Name>

Next you will see the latest version has been reflected in Docker Hub. Clicking the public view on the right, you will be able to share the url and others should be able to pull it.

💡 Tips and Tricks

Better to use a dash instead of underscore in the docker tags
If you want to keep your Dockerfile clean and tidy, you can use Hadolint which is a Dockerfile linter and also add this to the make file hadolint streamlit_demo/Dockerfile
To save the hassle of having to push to Docker Hub after every update, you can click on Builds > Configure Automated Builds and connect to your Github.
Instead of running several commands for each project, you can save them in a bash file called run.sh and then run sh run.sh to execute all the commands.

dockerpath="change this to <DockerHub Account Name>/<DockerHub Repo Name>"
#log in
docker login
#build
docker build --tag=$dockerpath:latest .
#push
docker image push $dockerpath

Running the app directly with Github url

The 3rd method is to share the Github url link of your app.py, under the condition that app.py is all the file you need to run the app and you have no other dependencies such as data files or helper scripts etc.

streamlit run <app.py github url>

And see the app in action. However, this method doesn't bring alone the csv files in your folder. So the previous 2 methods are still the most solid for sharing your apps.

Yet another option is that Streamlit will offer its own deployment solution. When that happens, it would be the most convenient solution for Streamlit. But deployment methods we covered earlier are still applicable to all kinds of apps and scripts beyond Streamlit (some tweaking might be needed in the config files).

⌛Wrap-up

Today we reviewed how to use Docker to containerize Streamlit app and how to deploy it to Heroku. You could also deploy it to GCP/AWS/Azure

Additional useful resources
Streamlit wiki - links to articles on GCP/AWS/Azure deployment
Please contain yourself - Clear & thorough explanation of how Docker works in a series.

Cover photo credit

Getting started in building and deploying interactive data science apps with Streamlit - Part 1

hannahyan — Sun, 08 Mar 2020 04:59:47 +0000

Flask used to come to mind when data scientists want to spin up a python-based data science app, but there is a better option now. To create an interactive facade for a machine learning or visualization script, Streamlit is way faster, since it removed the need to write any front-end code.

Now we 'll go through step-by-step how to build a Streamlit app. I will also review some pros and cons of Streamlit.

Who is this for

Anyone who wants to put an interactive user interface or visible facade to the python scripts.

Prerequisite

Python knowledge

What can it do

Streamlit can be used to built machine learning/AI apps or display exploratory/analytical data visualizations or both at the same time.

Image credit: Streamlit

⛵ Getting started

Hello world

To get started, first install the library

pip install streamlit

Then create a folder called streamlit_demo (or your preferred name) and add an app.py file to it.

import streamlit as st
st.text("Hello world")

In terminal cd into the folder and type

streamlit run app.py

You will see these in the terminal.

Now a browser window will automatically open up and you have the first app up and running🎉

Next we can start to add UI components, which are treated as variables in app.py, add this snippet to understand what that means.

if st.checkbox("show"):
    st.title("My first app")

Some of the common UI components include radio button, checkbox, selectbox, slider, and text input can be found in this documentation.

Now that we have a basic idea of how this works, we can start reading data and make a simple plot. The data and code for the following sections can be found in this gist.

import streamlit as st
import pandas as pd
import numpy as np


# header
st.title("London Bikes")
st.subheader("Numuber of Trips starting from Hyde Parker Corner")

# read data - source: London Bicycle Hires from Greater London Authority on Google Datasets via Bigquery
df = pd.read_csv('london_bikes.csv')


# area plot
df['start_day'] = pd.to_datetime(df['start_date']).dt.date
df_trips_by_day = df['start_day'].value_counts()
st.area_chart(df_trips_by_day)

Using Streamlit's built in charting function, we made our first plot in a few lines of code, and it's interactive and downloadable!

Going one step further, one can visualize the destinations of all the trip started from Hyde Park Corner.

💡 Tips and Tricks

As one adds libraries and functions to the app, it becomes more important to organize the dependencies in one place. A nifty trick to make things easier is to create Makefile and requirements.txt. As an option, we can create these two files to make the workflow easier.

Requirements.txt list out the packages needed for the app to run

Makefile offers a recipe on how to properly set up the app.

For example, when you run make install, it will automatically install all the dependencies.

Now that we had a glimpse of the functionalities of Streamlit, we can weigh it against other frameworks.

⚖️Pros and Cons

Streamlit has a long list of pros which I love:

Accessible app making for everyone (who uses Python). This is the main draw, since it can save time by allowing one to focus on the data science aspect, and also suits those who may not want to learn HTML/CSS. The learning curve is fairly flat.
Cover most common UIs needed in a data app. Plus, they look good! It contains slider, checkbox, radio buttons, a collapsible side bar, progress bar, file upload, etc. Overall these functionalities and the ease of use are impressive. It would be even nicer if can have an information button next to certain components to offer further explanations.
Support multiple interactive visualization libraries. It supports libraries such as plotly, altair, bokeh, Vega-Lite, and pydeck.

And here are some other libraries it's compatible with.

And since it's python-based, one can run some ml algorithm in the same app and then plot charts on the output of algorithm, such as cluster or classification labels.

💡 Tips and Tricks

To view more UI chart components in action, this demo app is a good place to start where you can view both the chart and code side-by-side.

Some cons:

I shall start off by saying none of these cons is prohibitive. They are more of a good-to-have and are listed here to keep a balanced evaluation of the tool

Convenience vs flexibility. Not specific to Streamlit, the higher level a framework is, the less customization it tends to provide.

In streamlit's case, one cannot customize the layout currently (but it's on their roadmap), which makes it not the most suitable choice for complex dashboards that require container-like layout. Streamlit layout mainly consists of a big vertical panel and a small side panel. One cannot really position 2 different charts side-by-side, or add custom elements in any specified locations. If you want to customize everything, you might have to use React.js/Vue.js.

Size of data input. Streamlit has a soft limit of 50MB for data upload. I will be interested in How the framework enables large scale application going forward.
Limited support for video/animation. No native support for video format and no play button. This can limit certain use case involving video analysis and animation.

Last but not least, one needs to be aware that it is focused on DS/ML use cases, thus will not offer the full suite of functionalities that Flask/Django have. This might not necessarily be a con, having a focus could a good thing. It simply means if one wants to build apps with other functionalities such as user authentication, newsletter subscription, or user-to-user interaction, then it's better to look elsewhere.

⌛ Wrap-up

Streamlit is simple enough such that everyone can use it. It can be a superb option if you need a quick solution for an interactive data app, especially with both machine learning algorithm running in the background and interactive plots. Hopefully today we had some fun building data science apps. In a future post, we'll cover how to deploy it to the cloud and containerized it using Docker.

For more information, these resources might be useful.
awesome-streamlit
streamlit gallery