Tuesday, September 12, 2017

The Real Data Processing with Apache Beam and Tika

If we talk about the data injestion in the big data streaming pipelines it is fair to say that in the vast majority of cases it is the files in the CSV and other text, easy to parse formats which provide the source data.

Things will become more complex when the task is to read and parse the files in the format such as PDF. One would need to create a reader/receiver capable of parsing the PDF files and feeding the content fragments (the regular text, the text found in the embedded attachments and the file metadata) into the processing pipelines. That was tricky to do right but you did it just fine.

The next morning you get a call from your team lead letting you know the customer actually needs the content injested not only from the PDF files but also from the files in a format you've never heard of before. You spend the rest of the week looking for a library which can parse such files and when you finish writing the code involving that library's not well documented API all you think of is that the weekends have arrived just in time.

On Monday your new task is to ensure that the pipelines have to be initialized from the same network folder where the files in PDF and other format will be dropped. You end up writing a frontend reader code which reads the file, checks the extension, and then chooses a more specific reader.   

Next day, when you are told that Microsoft Excel and Word documents which may or may not be zipped will have to be parsed as well, you report back asking for the holidays...

I'm sure you already know I've been preparing you for a couple of good news.

The first one is a well known fact that Apache Tika allows to write a generic code which can collect the data from the massive number of text, binary, image and video formats. One has to prepare or update the dependencies and configuration and have the same code serving the data from the variety of the data formats.

The other and main news is that Apache Beam 2.2.0-SNAPSHOT now ships a new TikaIO module (thanks to my colleague JB for reviewing and merging the PR). With Apache Beam capable of running the pipelines on top of Spark, Flink and other runners and Apache Tika taking care of various file formats, you get the most flexible data streaming system.

Do give it a try, help to improve TikaIO with new PRs, and if you are really serious about supporting a variety of the data formats in the pipelines, start planning on integrating it into your products :-)

Enjoy!



Wednesday, September 6, 2017

Mostly On Topic: CXF and Swagger Integration Keeps Getting Better

While thinking about a title of this post I thought the current title line, with the " Keeps Getting Better" finishing touch may work well; I knew I used a similar line before, and after looking through my posts I found it.

Oh dear. I'm transported back to 2008, I can see myself, 9 years younger, walking to the Iona Technologies office, completely wired on trying to stop the Jersey JAX-RS domination :-), spotting an ad of the latest  Christina Aguilera's albom on the exit from the Lansdowne Dart station and thinking, it would be fun, trying to blog about it and link to CXF, welcome to the start of the [OT] series. I'm not sure now if I'm more surprised it was actually me who did write that post or that 9 years later I'm still here, talking about CXF :-).

Let me get back to the actual subject of this post. You know CXF started quite late with embracing Swagger, and I'm still getting nervous whenever I remind myself Swagger does not support 'matrix' parameters :-). But the Swagger team has done a massive effort through the years, my CXF hat is off to them.

I'm happy to say that now Apache CXF offers one of the best Swagger2 integrations around, at the JSON only and UI levels and it just keeps getting better.

We've talked recently with Dennis Kieselhorst and one can now configure Swagger2Feature with the external properties file which can be especially handy when this feature is auto-discovered.

Just at the last minute we resolved an issue reported by a CXF user to do with accessing Swagger UI from the reverse proxies.

Finally, Freeman contributed a java2swagger Maven plugin.

Swagger 3 will be supported as soon as possible too.

Enjoy!