Building Hadoop clusters review

Building_Hadoop_Clusters

If you are interested in Hadoop technology probably this is an interesting video course you should evaluate. As you probably know, Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. All the modules in Hadoop are designed with the assumption that hardware failures are common and thus should be automatically handled in software by the framework.

Talking about the video course, we can divide the content in three main macro-sections:
1. how to create and set up a three machines cluster using Amazon EC2,
2. how to install an Hadoop cluster using Apache Ambari,
3. how to start using Hadoop cluster, in particular with Apache Hadoop User Interface (HUE).

The description of all the topics is clear and well done (Sean Mikha, the author, did a good job). All the relevant topics are always detailed before with an explanation of the logic structure and approach and only after with a demostration on how to do it in practice.

Useful also for other purposes, the creation of the virtual machines on Amazon EC2. The practical description and the step by step creation, is not limited to the server’s creation but is detailed also in what concerns the security and connection using, for example, putty ssh client.

apache hadoopIn my opinion the most relevant value of this video course is on the hidden details of the Hadoop cluster installation process. As you will see if you will decide to follow it, the tasks are quite easy to do (probably this a Sean’s merit) but the configuration details and settings are very important if you want to make it work in practice. Following the hints I’m sure every neophyte will gain days of work and lot of nights in googling. ūüėČ

Enjoy your Hadoop Cluster video course…¬†as usual by Packt Publishing.

Francesco Corti

Advertisements

Solr doesn’t return more than 1,000 objects in Alfresco.

lucene_apacheOnce upon a time Alfresco used Apache Lucene as search engine….

This was great until you had particular needs like, for example, a long duration query or a query that retrieves a huge amount of objects. It was more than a year ago when I wrote a post talking how Alfresco retrieves 1,000 results maximum or query for a couple of minutes.

solrAs you can read in the post, the most suggested solution to the problem was to migrate the indexing engine to Apache Solr. At that time, Alfresco supported both the engines and considered Solr as its future.

Today Lucene and Solr are always supported and Solr is probably the most used, but regarding the same issue, probably something is coming back again.

>> https://issues.alfresco.com/jira/browse/ALF-20567(*) <<

As you can read from the JIRA issue, in Alfresco 4.2.e SOLR also returns a maximum of 1,000 results and to solve the issue is suggested to set the parameters below in the alfresco-global.properties file.

solr.query.maximumResultsFromUnlimitedQuery=10000
system.acl.maxPermissionChecks=10000

This could have a high impact on “big” queries or “long” queries so I would like to share this information with all of you to prevent problems or nights spent on the debugger. ūüėČ

I hope this will help you.

Francesco Corti

(*) Thanks to Francesco Fornasari and Christian Tiralosi for the hint.

Yet another Alfresco Community upgrade tutorial: from 4.0.d to 4.2.f.

The task to upgrade Alfresco (Community or Enterprise) from a version to another more recent, has to follow a clear and precise path.¬†In every case it is always a critical task and in some cases could be a serious problem for the Organizations (of course this is more critical for Community Editions).¬†In some cases the only possible solution is an Alfresco-to-Alfresco migration instead of an upgrade… but this is another scenario.

In this tutorial is described a step-by-step approach to an upgrade from an Alfresco Community Edition v4.0.d to v4.2.f in one only upgrade step. In every case, even if the involved versions are different, the approach is always the same discussed here.

Needless to say: I am not responsible for any damage that may happen after following the given instructions, which hopefully will not happen.

The (only) correct approach

Before starting I would like to share the (only) correct approach: please remember that the upgrade process for the Alfresco Community Editions is tested (and not guarantee) for the closest versions (for Alfresco Enterprise you can take a look here). This means that the only path you can follow to upgrade a very old version to a recent one is always to develop multiple upgrades.

For example, if you come from the v4.0.d and want to go to the recent v5.0.a, it’s only written in the stars if the direct upgrade will work.¬†The most verified approach is to develop the upgrade process with the steps described below:
– Upgrade from v4.0.d to v4.0.e,
– Upgrade from v4.0.e to v4.2.a,
– Upgrade from v4.2.a to v4.2.b,
– Upgrade from v4.2.b to v4.2.c,
– Upgrade from v4.2.c to v4.2.d,
– Upgrade from v4.2.d to v4.2.e,
– Upgrade from v4.2.e to v4.2.f,
– Upgrade from v4.2.f to v5.0.a.

You can take your own risks “jumping” some steps, and in some cases it would work, but nothing is garantee in every case.¬†In this tutorial I decided to take a reasonable risk, often discussed in the forums and tutorials, and “jump” with a single upgrade process.

Preparing the upgrade

To develop the upgrade I need the Alfresco backup of my v4.0.e production installation.¬†If you don’t know what is an Alfresco backup and how to obtain it, I strongly recommend to take a look here.

In this tutorial I choose to define a brand new server with the recent Alfresco installation (in our case the v4.2.f) but you could choose to use the same server.¬†Of course, in this case, the task is even more critical and the steps are the same but developed in different folders from the “old” version of Alfresco.

The new Alfresco installation

As introduced before, in this tutorial I work in a vanilla server with Ubuntu 14.04 LTS on board. In the server is installed Oracle Java v1.7.60u, always installed as described here.

To install Alfresco you can follow this tutorial even if it describes one specific version (the installation steps don’t change too much).¬†Alternatively you can choose to install it using the easier wizard.¬†In every case you will install the target version of Alfresco, in our case: Alfresco Community v4.2.f.

For the purpose of the post, the way you use to install Alfresco is not relevant but remember that it will be your brand new server, so it’s always suggested to have the most robust and stable one. ūüėČ

If you have some customizations (custom models, behaviors, actions or something else) not it’s time to install them in the new server.¬†The task is always the same: stop alfresco, deploy the customizzations in the way you always do (AMP, maven, manually) and start Alfresco again.

As final step, it is always suggested to switch off the indexing. In our case we suppose to use Solr but with Lucene it will be the same. To develop the task, please follow the steps below:

cd <alfresco>
./alfresco.sh stop
nano tomcat/shared/classes/alfresco-global.properties

 ...
 index.subsystem.name=noindex
 #solr.port.ssl=8443 (comment it)
 ...

Save and exit.

Database restore

Now it’s time to restore the alfresco database from the backup.¬†To do it, please be sure that PostgreSql (or the database you use) is running.¬†If you installed the Alfresco with the wizard you can use the command below.

./alfresco.sh start postgresql

To delete the current Alfresco’s database use the commands below.

cd <postgresql>/bin
./psql -h localhost -U postgres -d postgres

  ...
  DROP DATABASE alfresco;
  CREATE DATABASE alfresco WITH owner = alfresco;
  \q
  ...

To restore the database dump you can use:

./pg_restore -h localhost -U postgres -d alfresco <file.dump>

Filesystem restore

Once the database is restored you have to restore the documents on the file system from the backup.

cd <alfresco>/alf_data
rm -rf contentstore
rm -rf contentstore.deleted

Now it’s time to copy the ‘contentstore’ and ‘contentstore.deleted’ folders form the backup, directly in the ‘alf_data’.

Can’t you see the indexes are not restored?¬†If possible it’s always preferrable to rebuild the indexes from scratch.¬†In the other cases we suggest to restore them from the backup, hoping nothing changed in the structure. ūüôā

Alfresco bootstrap

Now everything is ready to start alfresco again.

cd <alfresco>
./alfresco.sh start
tailf tomcat/logs/catalina.out

You will see that the starting process is updating the database and everything is necessary to upgrade the system.¬†Errors or problems will be listed here…

Indexes rebuild

As you read before, the Alfresco update has been without the indexes.
Now it’s time to rebuild them following what you read here.

./alfresco.sh stop
nano <alfresco>/tomcat/shared/classes/alfresco-global.properties

  ...
  index.subsystem.name=solr
  solr.port.ssl=8443
  ...

cd <alfresco>/alf_data/solr
rm -rf workspace/SpacesStore/*
rm -rf archive/SpacesStore/*
rm -rf workspace-SpacesStore/alfrescoModels/*
rm -rf archive-SpacesStore/alfrescoModels/*
cd <alfresco>
./alfresco.sh start

Enjoy your brand new Alfresco installation…

Francesco Corti

Alfresco roadmap for the next 12 months

roadmapAfter some requests from some users, the new Alfresco roadmap has been released in the official wiki.¬†This roadmap doesn’t seems to be like the others of¬†the past.

I read that the amount of topics are less than the past. By the way, each topic seems to be more detailed and “complete” (in the past most of the items were less specific than this). Comparing with the past roadmaps I can read a lot of “Enterprise only” in some important new features.

Have your own opinion reading the complete roadmap below.

https://wiki.alfresco.com/wiki/Product_Roadmap

Francesco Corti