Forem: Inzamam ul Haque

Hbase 'No protocol version header error'

Inzamam ul Haque — Wed, 17 Jun 2020 19:51:09 +0000

Hi there,

Earlier today, I faced a problem: I had a task that needed python happybase to do the operation on Hbase.

I received: “No protocol version header error”

Finally, I resolved this problem after digging up Cloudera docs.

First of all, I have to declare this before going further: the Hbase thrift server was opened before the above issue happened.

I searched for this HBase config setting in CM:

hbase.regionserver.thrift.http

It was checked. In /etc/hbase/conf/hbase-sit.xml, It’s value was ’true’

I found a link: https://github.com/wbolster/happybase/issues/161

So, I unchecked it. ( In hbase-sit.xml, it becomes ‘false’ )

Then restarted Hbase service, and problem got solved

I saw a lot of people had this issue, so sharing here in the hope it would be helpful.

Using Docker to spin-up multiple neo4j server instances on same machine

Inzamam ul Haque — Fri, 08 May 2020 19:30:26 +0000

All those working on neo4j might be aware of the fact that on instance of neo4j server can mount only one database at a time. Many a times we come across situations where we have to setup more than one neo4j database, and for that purpose we have to bring up separate neo4j server instance for each database we need, which involves doing a hell lot of manual steps to download installation tarballs, pushing it to different locations on machine, changing ports in configs files, etc. Here, in situation like this docker comes quit handy.

The key to using more than one neo4j servers simultaneously is to use different ports for http, https and bolt connections which is quite easy to do with the docker image. For my purpose, I configured neo4j in such a way that it can access the database from a non-default location.

As using docker is invloved here, I’m assuming you already have docker installed on your linux machine. So, lets get our hands dirty.

We’ll follow these steps:

1. Pulling the neo4j docker image

docker pull neo4j

2. Running neo4j docker image

By default, the neo4j docker image mounts the following folders:

home: /var/lib/neo4j

config: /var/lib/neo4j/conf

logs: /var/lib/neo4j/logs

plugins: /var/lib/neo4j/plugins

import: /import

data: /var/lib/neo4j/data

certificates: /var/lib/neo4j/certificates

run: /var/lib/neo4j/run

These directories may correspond to the already existing directories on the system. In my case, I already had a neo4j community server running on my machine so all these locations were there and were being used by the server. Therefore, I had to provide a custom location that would provide the same information. These locations would not be there in your computer if you have not installed the server version and only intending to use the docker image. To my knowledge the most import of the above-mentioned folders are data this is where your actual database will be created/stored, import to put for example CSV files for import, conf to put neo4j.conf file.

Now, we’re going to run the neo4j docker image taking care of running the server on non-default ports and also creating or mounting the required folders from the desired location.

docker run --detach --name=neo4j-instance-1 --rm \

--publish=7475:7474 --publish=7476:7473 --publish=7688:7687 \

--volume=$HOME/neo4j-instance-1/data:/data \

--volume=$HOME/neo4j-instance-1/import:/import \

--volume=$HOME/neo4j-instance-1/conf:/conf neo4j

Let’s break-down the above command.

docker run …… neo4j is to run the neo4j docker image
–-detach to run the container in the background and return the prompt.
--name=neo4j-instance-1 to give the desired name to the docker instance otherwise docker will choose a random name which might not be very easy to remember if we want to refer to this session in future for some reason.
-–rm is to delete the docker instance from the list upon session termination. This is useful if we want to reuse the same name.
-–publish=7475:7474 -–publish=7476:7473 -–publish=7688:7687 to publish/forward the default http, https and bolt ports to the desired ports. In this case, the http, https and bolt ports will be forwarded to the desired 7475, 7476 and 7687 respectively.
--volume=$HOME/neo4j-instance-1/data:/data \ --volume=$HOME/neo4j-instance-1/import:/import \ --volume=$HOME/neo4j-instance-1/conf:/conf to mount the desired locations for the database creation or access.

Note : If you are running this command for the first time, it will create the folders mentioned in –volume tag. Otherwise, it will mount the existing folders to the neo4j docker defaults.

If no error is returned then your neo4j server is running and should have been mapped to the desired ports and folders.

3. Check the docker and neo4j server running status.

To check the current running docker session run docker ps, whis should give you an output something like this:

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

1afa157d9caa neo4j "/sbin... 7473/tcp,... 87/tcp neo4j-instance-1

To terminate this session: docker kill neo4j-instance-1

Note: You can also check the status by using: netstsat -tunlp | grep 7475

To check neo4j running status: open a web browser and then navigate to

http://:7475 (or the port that you have used for forwarding in step 2)

It should display a page like the one below:

Change the bolt port to 7688 (or the port that you have used for forwarding in step 2), use user/password as neo4j/neo4j and click connect. Set new password in next screen.

It should connect you to your default graph.db database which should look something like below.

Optimization essentials for your Neo4j Cypher queries

Inzamam ul Haque — Thu, 16 Apr 2020 07:02:50 +0000

I have been working on Neo4j for quite some time. It has been a good learning experience with it and has allowed exploring and extracting knowledge from connected data. In case you’re not familiar with it, Neo4j is currently the leading vendor in the space of graph databases.

Around a month ago, while helping on a project of a friend, I had to spend few hours trying to optimize about 10 Cypher queries that not performing good enough (query time ranged from 46786ms to 135759ms) on a QA server. After some trial and error, I had changed them all and brought down query run-time to 2367ms to 5755ms. At that time I understood why it is very necessary to focus on query optimization from the very beginning of your development cycle. So writing this one in case if someone may get help.

The First thing towards query optimization is to check the execution plan of your query. Neo4j provides two keywords for it:

EXPLAIN
PROFILE

Both of them can be prefixed with your query to check the execution plan. Only difference between lies in the fact that EXPLAIN provides estimates of the graph engine processing that will occur, but does not execute the Cypher statement, while PROFILE provides real profiling information for what has occurred in the graph engine during the query and executes the Cypher statement.

We can use it like this:

PROFILE MATCH

(celebrity:Person)<-[:FOLLOWS*0..4]-(follower:Person)

return celebrity,follower limit 500

Second, I started indexing the most frequently used properties of nodes of different labels. This improved the timings by a good margin. For a node, the index can be created on single or multiple nodes.

For creating a single-property index,

CREATE INDEX [index_name] FOR (n:LabelName) ON (n.propertyName)

For creating a composite index,

CREATE INDEX [index_name] FOR (n:LabelName)

ON (n.propertyName_1, n.propertyName_2, ..., n.propertyName_n)

Remember that_, If you have set a constraint on any property of a node, there is no need to create index for that property._

Third , make sure to use parameters in maximum numbers of cypher queries is using parameter(s). If you go through Neo4j documentation, you’ll know how it helps in caching of execution plans.

Fourth, although it will be possible always, is to avoid ORDER BY and DISTINCT clauses if not needed compulsorily. They add a lot of time.

I’ve been still unable to fully simplify few of the things which include removal of optional paths. Like one of below format:

MATCH A-[o?:optional]-B

WHERE (o is present, match B to C and D)

OR (o is absent, match A to E and F)

I’ll update this section later whenever I get a full-proof way to remove optional paths. If you have any suggestions on this, please help me out on twitter.

Fifth, If you are running updates make sure that your updates handle data of about 10k-100k records, if you have more, please batch them.

Sixth, try to run tests on servers with as much less as resources available, I would recommend your local machine.

One more suggestion from my side would be to use test/development data this is a bit close to production data. This will help against the discovery of special cases in your data; like you might find later that some nodes are heavily connected, others, not so much, and some queries perform differently for lightly and heavily connected nodes.

For Summary:

Try to check the query plan with EXPLAIN and PROFILE.
Analyze time taken not only during the execute() on the query, but also the time to iterate through the results.
Index your most-used properties.
Parameterize your queries.
Examine your MATCH and RETURN clauses. Include in the MATCH only those parts that are required in RETURN. The remaining which would be to filter the results can go into the WHERE.
Get rid of ORDER BY and DISTINCT wherever possible.
Optional paths can be moved from the MATCH into WHERE if you don’t needed.
Try to use live data as far as possible.
Use batches while running updates.

You can refer to these for more details:

Shell loops to process text files? Think again!

Inzamam ul Haque — Fri, 10 Apr 2020 15:12:13 +0000

Holla folks! How’s lockdown going on? 😁

I’m writing this as first entry to this what I think could live on to be called my blog. A lot of procrastination, a hell lot of ifs and buts on how should I start writing, and then yesterday I decided to write on it!

I am not sure whether you have used shell scripts or you write them regularly. In any case, just take this post as my opinion, not as set of hard-bound rules .

Lately, I have come across many scripts and people writing them for the purpose of processing text files, having things like this:

while read row; do

   echo $row | cut -c3

done

for row in `cat file`; do

   foo=`echo $row | awk '{print $2}'`

   echo blahblah $foo

done

Those are naive literal translations of what you would do in languages like C or python, but that’s not how you do things in shells, and those examples are very inefficient, completely unreliable, and if you ever manage to fix most of the bugs, your code becomes illegible(I will explain this later.)

Conceptually, in these languages, building blocks are just one level above machine instructions. You tell your processor what to do and then what to do next. You open that file, you read that many bytes, you do this, you do that with it.

Shells are a higher level language. I would say it’s not even a language, they’re just all command line interpreters. The job is done by those commands you run and the shell is only meant to orchestrate them.

I think shell is more like a plumbing tool. You open the files, setup the pipes, invoke the commands. and when it’s all ready, it just flows without the shell doing anything. The tools do their job concurrently, efficiently at their own pace with enough buffering so as not one blocking the other, it’s just beautiful and yet so simple.

Let’s take an example of cut, it is like opening the kitchen drawer, take the knife, use it, wash it, dry it, put it back in the drawer.

When you do:

while read row; do

   echo $row | cut -c3

done

It’s like for each line of the file, getting the read tool from the drawer, read a line, wash your read tool, keep it back to the drawer. Then schedule a meeting for the echo and cut tool, get them from the drawer, invoke them, wash them, dry them, place them back in the drawer. This process keeps repeating till the last line.

You can just read the above para as: slicing an onion but washing your knife and putting it back to the drawer between each slice. But Here, the obvious better way is to get your cut tool from the drawer, slice your whole onion and put it back in the drawer after the whole job is done.

In other words, in shells, especially to process text, you should invoke as few utilities as possible and have them cooperate to the task, not run thousands of tools in sequence waiting for each one to start, run, clean up before running the next one.

Talking in terms of performance, When you do a fgets() or fputs() in C, that’s a function in stdio. stdio keeps internal buffers for input and output for all the stdio functions, to keep away from to doing expensive system calls too often.But, The corresponding even builtin shell utilities (read, echo, printf) can’t do that.

read is meant to read one line. If it reads past the newline character, that means the next command you run will miss it. So read has to read the input one byte at a time.
echo, printf can’t just buffer its output, it has to output it straight away because the next command you run will not share that buffer.

And especially, when we get on to processing the big file, which could have thousands or millions of lines, it is not fine for the shell script to take a significant fraction of a second (even if it’s only a few dozen milliseconds) for each line, as that could add up to hours.

What is the alternative?

Instead of using a loop to look at each line, we need to pass the whole file through a pipeline of commands. This means that, instead of calling the commands thousands or millions of time, the shell calls them only once. It’s true that those commands will have loops to process the file line-by-line, but they are not shell scripts and they are designed to be fast and efficient.

Unix has many wonderful built in tools, ranging from the simple to the complex, that we can use to build our pipelines. Simple tools include head, tail, grep, sort, cut, tr, sed, join(when merging 2 files), and awk one-liners, among many others. When it gets more complex, and you really have to apply some logic to each line, awk is a good option.

I’m giving you a simple example from a script I’ve written:

cat file.txt | grep -w "|" | grep -Ewo "[0-9]" | sed '4q;d' | awk '{$1=$1};'

Here,

I’m pushing contents of file.txt to pipe using cat file.txt
then selecting only lines containing ’|’ using grep -w "|"
selecting lines containing only digits by, grep -Ewo "[0-9]"
then just selecting the 4th line from the file using, sed '4q;d'
and in last, removing white-spaces by awk '{$1=$1};'

But, my suggestion as a whole would be to avoid using shell for doing what it’s not very good at and what it is not intended for.

Finally , I would close this on note that Its all your call to decide what to use based on the situation and your requirements. And with a good old proverb of ”horses for courses”, I would suggest that next time take a pause before thinking: ’ I’m going to do this in shell’! 😄

In case if you want to dig into some deeper details, I’m leaving with few good articles/answer:

Tada!