Home » #recent


Interesting bits and pieces I stumble upon strolling through CyberSpace. Latest 20 posts shown – follow the links to the Archives sections for a full lists of posts on this site.

You can also have a peek of trends in terms of search interest (via Google) on specific topics.

Reproduced and/or syndicated content. All content and images is copyright the respective owners.

EU funds project to boost European cloud computing market - ComputerWeekly



A European Union-funded project called Cloudcatalyst has been set up to assess the current cloud computing market in Europe, identify barriers to cloud adoption and provide tools to boost its growth in the region.

The project aims to instill confidence in European businesses, public entities, ICT providers and other cloud stakeholders eager to develop and use cloud services.

Read the original article /reproduced from ComputerWeekly

need expertise with cloud / internet scale computing / mapreduce / hadoop  etc. ? contact me - i can help! - this is a core expertise area.

It will create “a strong and enthusiastic community of cloud adopters and supporters in Europe”, according to Cordis, the European Commission’s project funding arm.

According to the EC, cloud computing is a “revolution” but its providers are still struggling to captivate and build trust among businesses and everyday citizens. “Cloud-sceptics” are concerned over data security and legal exposure and a lack of information around cloud is hindering its adoption.

The Cloudcatalyst project will tackle this issue by providing useful tools to foster the adoption of cloud computing in Europe and to boost the European cloud market, according to Cordis, the European Commission’s primary public repository that gives information about EU-funded projects.

The project, which is funded by FP7 – the 7th Framework Programme for Research and Technological Development – will target all cloud players. These include software developers, members of the scientific community developing and deploying cloud computing services, incubators at the local, national and European levels, large industries, SMEs, startups and entrepreneurs.

With a total budget of over €50bn, the project will primarily analyse practices across Europe and identify the conditions for a successful adoption.

“We will cover all the main issues around cloud and give a clear overview on a number of topics, such as current cloud trends, critical success factors to overcome major technical barriers, data privacy and compliance requirements, and recommendations for quality of service and cloud SLA,” said Dalibor Baskovc, vice-president at EuroCloud Europe, one of the project partners.

We see cloud as an engine of change and a central ingredient for innovation in Europe

Francisco Medeiros, European Commission

The project will also create a series of tools to help stakeholders create value-added cloud products and services. These consist of the Cloud Accelerator Toolbox and the Go-to-the-Cloud service platform – a collection of management tools bundling together trend analysis, use cases and practical recommendations in the form of printable report templates and instructional videos.

“The tools we are developing will help companies adopt and deploy cloud solutions, whatever their different needs and requirements are,” said Baskovc.

The project will also carry out a number of market surveys to gather key information and produce an overview of the cloud adoption status, such as why companies should develop cloud services, the main internal problems in adopting a cloud product, the associated risks and how these issues can be addressed.

According to the European Commission, cloud computing has the potential to employ millions in Europe by 2020.

“We see cloud as an engine of change and a central ingredient for innovation in Europe,” Francisco Medeiros, deputy head of unit, software and services, cloud computing at the European Commission told the Datacentres Europe 2014 audience in May this year. “Cloud is one of the fastest-growing markets in Europe.”

In 2013, worldwide hardware products grew by 4.2% to €401bn, while software and services grew by 4.5% to €877bn, signifying the importance of software services, said Medeiros.

need expertise with cloud / internet scale computing / mapreduce / hadoop  etc. ? contact me - i can help! - this is a core expertise area.

Read the original article /reproduced from ComputerWeekly

Debunking five big HTML5 myths

HTML 5 - http://www.w3.org/

HTML 5 – http://www.w3.org/

The ongoing discussion about the “readiness” of HTML5 is based on a lot of false assumptions. These lead to myths about HTML5 that get uttered once and then continuously repeated – a lot of times without checking their validity at all.

Reproduced from/read the original at Telefonica

Guest post from Christian Heilmann, Principal Developer Evangelist at Mozilla for HTML5 and open web

HTML5 doesn’t perform?

The big thing everybody wants to talk about when it comes to the problems with HTML5 is… performance. The main problem here is that almost every single comparison misses the fact that you are comparing apples and pears (no pun intended).

Comparing an HTML5 application’s performance with a native App is like comparing a tailored suit with one bought in a shop. Of course the tailored suit will fit you like a glove and looks amazing, but if you ever want to sell it or hand it over to someone else you are out of luck. It  just won’t be the same for the next person.

That is what native Apps are – they are built and optimized for one single environment and purpose and are fixed in their state – more on that later.

HTML5, on the other hand by its very definition is a web technology that should run independent of environment, display or technology. It has to be as flexible as possible in order to be a success on the web.

In its very definition the web is for everybody, not just for a small group of lucky people who can afford a very expensive piece of hardware and are happy to get locked into a fixed environment governed by a single company.

Native applications need to be written for every single device and every new platform from scratch whereas an HTML5 App allows you to support mobiles, tablets and desktops with the same product. Instead of having fixed dimensions and functionality an HTML5 App can test what is supported and improve the experience for people on faster and newer devices whilst not locking out others that can not buy yet another phone.

Native Apps on the other hand do in a lot of cases need an upgrade and force the end user to buy new hardware or they’ll not get the product at all. From a flexibility point of view, HTML5 Apps perform admirably whilst native applications make you dependent on your hardware and leave you stranded when there is an upgrade you can’t afford or don’t want to make. A great example of this is the current switch from Apple to their own maps on iOS. Many end users are unhappy and would prefer to keep using Google Maps but can not.

Seeing that HTML5 is perfectly capable on Desktop to exceed in performance, from scrolling performance to analyzing and changing video on the fly up to running full 3D games at a very high frame rate and have high speed racing games we have to ask ourselves where the problem with its performance lies.

The answer is hardware access. HTML5 applications are treated by mobile hardware developed for iOS and Android as second class citizens and don’t get access to the parts that allow for peak performance. A web view in iOS is hindered by the operating system to perform as fast as a native App although it uses the same principles. On Android both Chrome and Firefox show how fast browsers can perform whereas the stock browser crawls along in comparison.

The stock browser on Android reminds us of the Internet Explorer of the 90s which threatened to be set in stone for a long time and hinder the world wide web from evolving – the very reason Mozilla and Firefox came into existence.

In essence HTML5 is a Formula 1 car that has to drive on a dirt road whilst dragging a lot of extra payload given to it by the operating system without a chance to work around that – for now.

HTML5 cannot be monetized?

HTML5 is a technology stack based on open web technologies. Saying that HTML5 has no monetization model is like saying the web can not be monetized (which is especially ironic when this is written on news sites that show ads).

Whilst on the first glance a closed App-market is a simple way to sell your products there is a lot of hype about their success and in reality not many developers manage to make a living with a single app on closed App markets. As discovery and find-ability is getting increasingly harder in App markets a lot of developers don’t build one App but hundreds of the same App (talking dog, talking cat, talking donkey…) as it is all about being found quickly and being on the first page of search results in the market.

This is where closed App markets with native Apps are a real disadvantage for developers: Apps don’t have an address on the web (URL) and can not be found outside the market. You need to manually submit each of the Apps in each of the markets, abide to their review and submission process and can not update your App easily without suffering outages in your offering.

An HTML5 App is on the web and has a URL, it can also get packaged up with products like Adobe PhoneGap to become a native application for iOS or Android. The other way around is not possible.

In the long term that begs the question what is the better strategy for developers: betting on one closed environment that can pull your product any time it wants or distributing over a world-wide, open distribution network and cover the closed shops as well?

Many apps in the Android and iOS store are actually HTML5 and got converted using PhoneGap. The biggest story about this was the Financial Times releasing their app as HTML5 and making a better profit than with the native one. And more recently the New York Times announced it was following suit with its Web app.

HTML5 cannot be used offline?

As HTML5 is a web technology stack the knee-jerk reaction is thinking that you would have to be online all the time to use them. This is plain wrong. There are many ways to store content offline in a HTML5 application. The simplest way is the Web Storage API which is supported across all modern browsers (excluding Opera mini which is a special case as it sends content via a cloud service and has its own storage tools).

You can also store the application itself offline using AppCache which is supported by all but Internet Explorer. If you have more complex data to store than Web Storage provides you can use either IndexedDB (for Chrome and Firefox) or WebSQL (for iOS and Safari). To work around the issues there are libraries like Lawnchair available to make it easy for developers to use.

HTML5 has no development environment?

One concern often mentioned is that HTML5 lacks in tooling for developers. Strangely enough you never hear that argument from developers but from people who want to buy software to make their developers more effective instead of letting them decide what makes them effective.

HTML5 development at its core is web development and there is a quite amazingly practical development environment for that available. Again, the main issue is a misunderstanding of the web.

You do not build a product that looks and performs the same everywhere – this would rob the web of its core strengths. You build a product that works for everybody and excels on a target platform. Therefore your development environment is a set of tools, not a single one doing everything for you. Depending on what you build you choose to use many of them or just one.

The very success of the web as a media is based on the fact that you do not need to be a developer to put content out – you can use a blogging platform, a CMS or even a simple text editor that comes with your operating system to start your first HTML page. As you progress in your career as a developer you find more and more tools you like and get comfortable and effective with but there is no one tool to rule them all.

Some developers prefer IDEs like Visual Studio, or Eclipse. Others want a WYSIWYG style editor like Dreamweaver but the largest part of web developers will have a text editor or other of some sorts. From Sublime Text, Notepad++ up to VIM or emacs on a Linux computer, all of these are tools that can be used and are used by millions of developers daily to build web content.

When it comes to debugging and testing web developers are lucky these days as the piece of software our end users have to see what we build – the browser – is also the debugging and testing environment. Starting with Firefox having Firebug as an add-on to see changes live and change things on the fly, followed by Opera’s Dragonfly and Safari and Chrome’s Devtools, all browsers now also have a lot of functionality that is there especially for developers. Firefox’s new developer tools go even further and instead of simply being a debugging environment are a set of tools in themselves that developers can extend to their needs.

Remote debugging is another option we have now. This means we can as developers change applications running on a phone on our development computers instead of having to write them, send them to the phone, install them, test them, find a mistake and repeat. This speeds up development time significantly.

For the more visual developers Adobe lately released their Edge suite which brings WYSIWYG style development to HTML5, including drag and drop from Photoshop. Adobe’s Edge Inspect and PhoneGap makes it easy to test on several devices at once and send HTML5 Apps as packaged native Apps to iOS and Android.

In terms of deployment and packaging Google just released their Yeoman project which makes it dead easy for web developers to package and deploy their web products as applications with all the necessary steps to make them perform well.

All in all there is no fixed development environment for HTML5 as that would neuter the platform – this is the web, you can pick and choose what suits you most.

Things HTML5 can do that native Apps can not

In essence a lot of the myths of HTML5 are based on the fact that the comparison was between something explicitly built for the platform it was tested on versus something that is also supported on it. Like comparing the performance of speedboat and a hovercraft would result in the same predictable outcome. The more interesting question is what makes HTML5 great for developers and end users, that native applications can or do not do:

  • Write once, deploy anywhere – HTML5 can run in browsers, on tablets and desktops and you can convert it to native code to support iOS and Android. This is not possible the other way around.
  • Share over the web – as HTML5 apps have a URL they can be shared over the web and found when you search the web. You don’t need to go to a market place and find it amongst the crowded, limited space but the same tricks how to promote other web content apply. The more people like and link to your app, the easier it will be found.
  • Built on agreed, multi-vendor standards – HTML5 is a group effort of the companies that make the web what it is now, not a single vendor that can go into a direction you are not happy with
  • Millions of developers – everybody who built something for the web in the last years is ready to write apps. It is not a small, specialized community any longer
  • Consumption and development tool are the same thing – all you need to get started is a text editor and a browser
  • Small, atomic updates – if a native app needs an upgrade, the whole App needs to get downloaded again (new level of Angry Birds? Here are 23MB over your 3G connection). HTML5 apps can download data as needed and store it offline, thus making updates much less painful.
  • Simple functionality upgrade – native apps need to ask you for access to hardware when you install them and can not change later on which is why every app asks for access to everything upfront (which of course is a privacy/security risk). An HTML5 app can ask for access to hardware and data on demand without needing an update or re-installation.
  • Adaptation to the environment – an HTML5 app can use responsive design to give the best experience for the environment without having to change the code. You can switch from Desktop to mobile to tablet seamlessly without having to install a different App on each.

Let’s see native Apps do that.

Breaking the hardware lockout and making monetization easier

The main reason why HTML5 is not the obvious choice for developers now is the above mentioned lockout when it comes to hardware. An iOS device does not allow different browser engines and does not allow HTML5 to access the camera, the address book, vibration, the phone or text messaging. In other words, everything that makes a mobile device interesting for developers and very necessary functionality for Apps.

To work around this issue, Mozilla and a few others have created a set of APIs to define access to these in a standardized way called Web APIs. This allows every browser out there to get access to the hardware in a secure way and breaks the lockout.

The first environment to implement these is the Firefox OS with devices being shipped next year. Using a Firefox OS phone you can build applications that have the same access to hardware native applications have. Developers have direct access to the hardware and thus can build much faster and – more importantly – much smaller Apps. For the end user the benefit is that the devices will be much cheaper and Firefox OS can run on very low specification hardware that can for example not be upgraded to the newest Android.

In terms of monetization Mozilla is working on their own marketplace for HTML5 Apps which will not only allow HTML5 Apps to be submitted but also to be discovered on the web with a simple search. To make it easier for end users to buy applications we partner with mobile providers to allow for billing to the mobile contract. This allows end users without a credit card to also buy Apps and join the mobile web revolution.

How far is HTML5?

All in all HTML5 is going leaps and bounds to be a very interesting and reliable platform for app developers. The main barriers we have to remove is the hardware access and with the WebAPI work and systems like PhoneGap to get us access these are much less of a stopper than we anticipated.

The benefits of HTML5 over native apps mentioned above should be reason enough for developers to get involved and start with HTML5 instead of spending their time building a different code base for each platform. If all you want to support is one special platform you don’t need to go that way, but then it is also pointless to blame HTML5 issues for your decision.

HTML5 development is independent of platform and browser. If you don’t embrace that idea you limit its potential. Historically closed platforms came and went and the web is still going strong and allows you to reach millions of users world-wide and allows you to start developing without asking anyone for permission or having to install a complex development environment. This was and is the main reason why people start working with the web. And nobody is locked out, so have a go.


Open Standards@EU

Open Standards@EU

Open Standards@EU

I am an strong believer and supporter of the adoption of and adherence to Open Standards, to the maximum extent possible (without ignoring specific context considerations, influence, applicability extent etc.). The Digital Agenda of Europe identified “lock-in” as a problem. Building open ICT systems by making better use of standards in public procurement will improve and prevent the lock-in issue.

Action 23 committed to providing guidance on the link between ICT standardisation and public procurement to help public authorities use standards to promote efficiency and reduce lock-in.

The Commission issued on June 2013 a Communication, accompanied by a Staff Working Document that contains a practical guide on how to make better use of standards in procurement, in particular in the public sector, and including some of the barriers.

Read more at OpenStandards@EU

The change to standards-based systems

Even though the short term costs might seem a barrier to change, in the long run. The change to a standards-based system will benefit the overall public procurement scenario. It should therefore be carried out on a long-term basis ( 5 to 10 years), replacing those systems which require a new procurement with alternatives that are standards-based.

This requires public authorities to list of all their ICT systems and understand how they work together, within their own organisation and with their stakeholders’ systems. They should identify which of these systems cannot be easily changed to other alternatives (these are systems causing the lock-in). For all these, they should consider alternatives standard-compelling.

In addition, the process should be replicated on every system part of the same network, improving adoption of common standards

Best practices

Fighting lock-in requires support by public authorities at all levels. Some countries are actively promoting the use of standards, and have already gained a lot of practical experience. In order to learn from their experience, the Commission organises meetings with public authorities, ICT supply industry, standards organisations and civil society from Europe.

By sharing their experience on a regular basis, public organisations learn from each other, adapt to emerging best practices and tackle common problems and solutions. This sharing of best practice will ensure that the choices made in different Member States will converge, reducing fragmentation and helping to ensure a real digital single market.

Neural networks that function like the human visual cortex may help realize faster, more reliable pattern recognition - PHYS.org

Artificial neural networks that can more closely mimic the brain’s ability to recognize patterns potentially have broad applications in biometrics, data mining and image analysis. Credit: janulla/iStock/Thinkstock

Artificial neural networks that can more closely mimic the brain’s ability to recognize patterns potentially have broad applications in biometrics, data mining and image analysis. Credit: janulla/iStock/Thinkstock

Despite decades of research, scientists have yet to create an artificial neural network capable of rivaling the speed and accuracy of the human visual cortex. Now, Haizhou Li and Huajin Tang at the A*STAR Institute for Infocomm Research and co-workers in Singapore propose using a spiking neural network (SNN) to solve real-world pattern recognition problems. Artificial neural networks capable of such pattern recognition could have broad applications in biometrics, data mining and image analysis.

Read the full original article from / reproduced from PHYS.ORG

Humans are remarkably good at deciphering handwritten text and spotting familiar faces in a crowd. This ability stems from the visual cortex—a dedicated area at the rear of the brain that is used to recognize patterns, such as letters, numbers and facial features. This area contains a complex network of neurons that work in parallel to encode visual information, learn spatiotemporal patterns and classify objects based on prior knowledge or statistical information extracted from patterns.

Like the human visual cortex, SNNs encode visual information in the form of spikes by firing electrical pulses down their ‘neurons’. The researchers showed that an SNN employing suitable learning algorithms could recognize handwritten numbers from the Mixed National Institute of Standards and Technology (MNIST) database with a performance comparable to that of support vector machines—the current benchmark for  methods.

Their SNN has a feedforward architecture and consists of three types of neurons: encoding, learning and readout neurons. Although the learning neurons are fully capable of discriminating patterns in an unsupervised manner, the researchers sped things up by incorporating supervised learning algorithms in the computation so that the learning  could respond to changes faster.

… Continue reading the full article from PHYS.ORG

More information: Yu, Q., Tang, H., Tan, K.C. & Li, H. Rapid feedforward computation by temporal encoding and learning with spiking neurons. IEEE Transactions on Neural Networks and Learning Systems 24, 1539–1552 (2013). dx.doi.org/10.1109/TNNLS.2013.2245677


Introducing Project Adam: a new deep-learning system - MSR

Members of the team that worked on the asynchronous DNN project: (from left) Karthik Kalyanaraman, Trishul Chilimbi, Johnson Apacible, Yutaka Suzue

Members of the team that worked on the asynchronous DNN project: (from left) Karthik Kalyanaraman, Trishul Chilimbi, Johnson Apacible, Yutaka Suzue

Project Adam is a new deep-learning system modeled after the human brain that has greater image classification accuracy and is 50 times faster than other systems in the industry.

Project Adam, an initiative by Microsoft researchers and engineers, aims to demonstrate that large-scale, commodity distributed systems can train huge deep neural networks effectively. For proof, the researchers created the world’s best photograph classifier, using 14 million images from ImageNet, an image database divided into 22,000 categories.

Included in the vast array of categories are some that pertain to dogs. Project Adam knows dogs. It can identify dogs in images. It can identify kinds of dogs. It can even identify particular breeds, such as whether a corgi is a Pembroke or a Cardigan.

Now, if this all sounds vaguely familiar, that’s because it is—vaguely. A couple of years ago, The New York Times wrote a story about Google using a network of 16,000 computers to teach itself to identify images of cats. That is a difficult task for computers, and it was an impressive achievement.

Project Adam is 50 times faster—and more than twice as accurate, as outlined in a paper currently under academic review. In addition, it is efficient, using 30 times fewer machines, and scalable, areas in which the Google effort fell short.

Read the full article/reproduced from Microsoft Research


Oracle Big Data SQL lines up Database with Hadoop, NoSQL frameworks



Hadoop continues to operate as a looming influence in the world of big data, and that holds true with the unveiling of the next step in Oracle’s big data roadmap.  Oracle’s latest big idea for big data aims to eliminate data silos with new software connecting the dots between the Oracle Database, Hadoop and NoSQL. By  for Between the Lines |

“Oracle has taken some of its intellectual property and moved it on to the Hadoop cluster, from a database perspective,” Mendelson explained.

need expertise with cloud / internet scale computing / mapreduce / hadoop  etc. ? contact me - i can help! - this is a core expertise area.

Read the original / reproduced from ZDNet

The Redwood Shores, Calif.-headquartered corporation introduced Oracle Big Data SQL, SQL-based software streamlining data running between the Oracle Database with NoSQL and Hadoop frameworks.

The approach is touted to minimize data movement, which could translate to faster performance rates for crunching numbers while also reducing security risks while in transit.

Big Data SQL promises to be able to query any and all kinds of structured and unstructured data. Oracle Database’s security and encryption features can also be blanketed over Hadoop and NoSQL data.

Beyond extending enterprise governance credit, Oracle connected plenty of dots within its portfolio as well. Big Data SQL runs on Oracle’s Big Data Appliance and is set up to play well with the tech titan’s flagship Exadata database machine. The Big Data SQL engine also borrowed other familiar portfolio elements such as Smart Scan technology for local data queries from Exadata.

The Big Data Appliance itself was built on top of Oracle’s cloud distribution, which has been in the works for the last three years.

Neil Mendelson, vice president of big data and advanced analytics at Oracle, told ZDNet on Monday that enterprise customers are still facing the following three obstacles: managing integration and data silos, obtaining the right people with new skill sets or relying on existing in-house talent, and security.

“Over this period of time working with customers, they’re really hitting a number of challenges,” Mendelson posited. He observed much of what customers are doing today is experimental in nature, but they’re now ready to move on to the production stage.

Thus, Mendelson stressed, Big Data SQL is designed to provide users with the ability to issue a single query, which can run against data in Hadoop and NoSQL — individually or any combination therein.

“Oracle has taken some of its intellectual property and moved it on to the Hadoop cluster, from a database perspective,” Mendelson explained.

In order to utilize Big Data SQL, Oracle Database 12c is required first. Production is slated to start in August/September, and pricing will be announced when Big Data SQL goes into general availability.

Also on Tuesday, the hardware and software giant was expected to ship a slew of security updates fixing more than 100 vulnerabilities across hundreds of versions of its products.

That is following a blog post on Monday penned by Oracle’s vice president of Java product management, Henrik Stahl, who aimed to clarify the future of Java support on Windows XP.

He dismissed claims that Oracle would hamper Java updates from being applied to systems running the older version of Windows or that Java wouldn’t work on XP altogether anymore.

Nevertheless, Stahl reiterated Oracle’s previous stance that users still running Windows XP should upgrade to an operating system currently supported.

need expertise with cloud / internet scale computing / mapreduce / hadoop  etc. ? contact me - i can help! - this is a core expertise area.

Read the original / reproduced from ZDNet

First major redesign of Rasberry Pi unveiled - TheEngineer

A new version of the credit card-sized computer, the Raspberry Pi, is launched today adding extra sensors and connectors to the £20 device.

A new version of the credit card-sized computer, the Raspberry Pi, is launched today adding extra sensors and connectors to the £20 device.

The new model, known as B+, represents the first major redesign of the Rasberry Pi since its commercial launch and features four USB ports to enable the computer to support extra devices without their own mains power connection.

The computer is designed and manufactured in the UK as a way of promoting computer science to young people.

But it has also been widely embraced by the wider amateur and professional engineering communities and used for projects from home-made drones to creating industrial PCs that can control hundreds of devices.

Read the original /reproduced from theEngineer.co.ok

The Raspberry Pi Foundation, the non-profit group that produces the device, hopes the extra connections and sensors will enable users to create bigger projects.

Eben Upton, CEO of Raspberry Pi Trading, said in a statement: ‘We’ve been blown away by the projects that have been made possible through the original B boards and, with its new features, the B+ has massive potential to push the boundaries and drive further innovation.’

Source: Raspberry Pi/Element 14

Source: Raspberry Pi/Element 14

The Raspberry Pi B+ is based on the same Broadcom BCM2835 Chipset and 512MB of RAM as the previous model.

It is powered by micro USB with AV connections through either HDMI or a new four-pole connector replacing the existing analogue audio and composite video ports.

The SD card slot has been replaced with a micro-SD, tidying up the board design and helping to protect the card from damage. The B+ board also now uses less power (600mA) than the Model B Board (750mA) when running.

It features a 40-pin extended GPIO, although the first 26 pins remain identical to the original Raspberry Pi Model B for 100% backward compatibility.

The Raspberry Pi Model B+ is available to buy today on the element14 Community.

Read the original /reproduced from theEngineer.co.ok

'Melbourne Shuffle' secures data in the cloud - PHYS.org

Encryption might not be enough for all that data stored in the cloud. An analysis of usage patterns -- which files are accessed and when -- can give away secrets as well. Computer scientists at Brown have developed an algorithm to sweep away those digital footprints. It's a complicated series of dance-like moves they call the Melbourne Shuffle. Credit: Tamassia Lab / Brown University

Encryption might not be enough for all that data stored in the cloud. An analysis of usage patterns — which files are accessed and when — can give away secrets as well. Computer scientists at Brown have developed an algorithm to sweep away those digital footprints. It’s a complicated series of dance-like moves they call the Melbourne Shuffle. Credit: Tamassia Lab / Brown University

That may sound like a dance move (and it is), but it’s also a  developed by researchers at Brown University.

The computing version of the Melbourne Shuffle aims to hide patterns that may emerge as users access data on cloud servers. Patterns of access could provide important information about a dataset—information that users don’t necessarily want others to know—even if the data files themselves are encrypted.

“Encrypting data is an important security measure. However, privacy leaks can occur even when accessing encrypted data,” said Olga Ohrimenko, lead author of a paper describing the algorithm. “The objective of our work is to provide a higher level of privacy guarantees, beyond what encryption alone can achieve.”

The paper was presented this week at the International Colloquium on Automata, Languages, and Programming (ICALP 2014) in Copenhagen. Ohrimenko, who recently received her Ph.D. from Brown University and now works at Microsoft Research, co-authored the work with Roberto Tamassia and Eli Upfal, professors of computer science at Brown, and Michael Goodrich from the University of California–Irvine.

Cloud computing is increasing in popularity as more individuals use services like Google Drive and more companies outsource their data to companies like Amazon Web Services. As the amount of data on the cloud grows, so do concerns about keeping it secure. Most cloud service providers encrypt the data they store. Larger companies generally encrypt their own data before sending it to the cloud to protect it not only from hackers but also to keep cloud providers themselves from snooping around in it.

But while encryption renders data files unreadable, it can’t hide patterns of data access. Those patterns can be a serious security issue. For example, a service provider—or someone eavesdropping on that provider—might be able to figure out that after accessing files at certain locations on the cloud server, a company tends to come out with a negative earnings report the following week. Eavesdroppers may have no idea what’s in those particular files, but they know that it’s correlated to negative earnings. But that’s not the only potential security issue. “The pattern of accessing data could give away some information about what kind of computation we’re performing or what kind of program we’re running on the data,” said Tamassia, chair of the Department of Computer Science. Some programs have very particular ways in which they access data. By observing those patterns, someone might be able to deduce, for example, that a company seems to be running a program that processes bankruptcy proceedings.

The Melbourne Shuffle aims to hide those patterns by shuffling the location of data on cloud servers. Ohrimenko named it after a dance that originated in Australia, where she did her undergraduate work. “The contribution of our paper is specifically a novel data shuffling method that is provably secure and computationally more efficient than previous methods,” Ohrimenko said.

It works by pulling small chunks of data down from the cloud and placing them in a user’s local memory. Once the data is out of view of the server’s prying eyes, it’s rearranged—shuffled like a deck of cards—and then sent back to the cloud server. By doing this over and over with new blocks of data, the entirety of the data on the cloud is eventually shuffled.

The result is that data accessed in one spot today, may be in a different spot tomorrow. So even when a user accesses the same data over and over, that access pattern looks to the server or an eavesdropper to be essentially random. “What we do is we obfuscate the access pattern,” Tamassia said. “It becomes unfeasible for the cloud provider to figure out what the user is doing.”

The researchers envision deploying their shuffle algorithm through a software application or a hardware device that users keep at their location. It could also be deployed in the form of a tamper-proof chip controlled by the user and installed at the data center of the cloud provider.  However it’s deployed, the approach has the promise of lowering the cost of strong  security in an increasingly cloudy computer world.

Explore further: Cracks emerge in the cloud

Read more at: http://phys.org/news/2014-07-melbourne-shuffle-cloud.html#jCp

SQL, Open Source, Security Stand Out In Hadoop Ecosystem - Forbes



SQL is the gateway drug to enterprise adoption says analysts at recent developer conference. Hadoop Summit, leading big data developer conference, saw the maturation of the Hadoop ecosystem. Hadoop is one of the necessary requirements in realizing the promise of Big Data’s application growth in the enterprise, and key players have emerged among those most influential in this development.

Reproduced from/read the original text from Forbes

Hortonworks, an initiative borne of Hadoop’s original team at Yahoo YHOO, has taken bold steps to branch out as its own organization, battling other service providers like Cloudera, MapR, and Pivotal. To date, Hortonworks has done well to stick with its vision of delivering the benefits of an open source Hadoop to the enterprise, raising $100 million in funding and initiating an acquisitions strategy to secure its place in Hadoop’s ecosystem. Yet Hortonworks remains in a horserace with others looking to monetize Hadoop, all rivaling to build effective and secure solutions that can manage today’s advanced analytics needs.

need expertise with cloud / internet scale computing / mapreduce / hadoop  etc. ? contact me - i can help! – this is a core expertise area.

Democratizing Big Data software with SQL

One way in which Hadoop’s framework is being improved for commercial use is through SQL enhancements, which can enable businesses to run low-cost Hadoop clusters leveraging the popular, high performance language. At Hadoop Summit in San Jose, CA this week, several SQL-compatible solutions have been unveiled and demonstrated by Hadoop vendors, including Actian and Splice Machine.

The last few months have been ripe with developments towards bringing SQL to Hadoop, including Apache Drill and the completion of the Stinger Initiative. Led by Hortonworks, Stinger brings full interactive query capabilities to Hadoop via a new and improved Hive. The Apache Drill project, led by MapR, provides another query option and support for JSON files.

Despite Hortonworks’ leadership role in the Stinger Initiative, its success speaks to the power and influence of open source within Hadoop’s ecosystem. With contributions from 145 developers across 45 organizations in just over a year, companies like SAP , Microsoft MSFT, WANdisco, Google and Netflix have contributed to Stinger, making it 100 times better in SQL query performance across petabytes of data, according to Wikibon analyst Jeff Kelly.

Indeed, Microsoft is becoming an important partner for Hortonworks, which recently announced Windows Azure HDInsight’s support of Hadoop 2.2 and Stinger Phase Two with query optimization and compression technology. From Kelly’s research, these new solutions yield forty times performance improvements around query.

The goal for Microsoft is to first bring Hadoop to Windows, then the cloud, and onto Azure, says John Kreisa, VP of Strategic Marketing at Hortonworks. He adds that these new solutions allow for multiple workloads on a single Hadoop cluster.

Open source vs. proprietary

Making these capabilities available across the entire Hadoop ecosystem is a business matter, but also proof of the innovation that’s come about through open source, collaborative efforts. Yet, can such collaboration help Hortonworks compete against other Hadoop vendors?

Also flush with cash is Hortonworks rival Cloudera, which now has the backing of Intel and an astounding total of $740 million in funding for 18% equity in Cloudera. Pivotal is also playing hardball, pricing its Hadoop services aggressively and putting pressure on others in the marketplace. Analysts and customers privately tell me that they are winning at the expense of building a viable developer community – the risk a company takes in trying to build fast. MapR has reported strong growth and customer success across multiple verticals including large financial services, Retail and Telco deployments.

Pivotal’s product plans keep pumping out products. They recently launched Big Data Suite extends a plethora of perks to simplify Hadoop’s application to the enterprise, making it easier to deploy across an unlimited number of nodes and leverage Pivotal HD and its support services without the extra fees. So even though some customers will choose Hortonworks’ vendor neutral approach to Hadoop deployments, Pivotal is upping the ante for its rivals to differentiate, “either through vastly superior support services or proprietary software” to justify pricier deployment models, writes Kelly.

The opportunity in securing Hadoop

Security is also an opportunity for Hadoop vendors to appeal to the enterprise, and recent developments indicate growing interest in properly securing data access for businesses eager to apply advanced analytics techniques.

This is where Hortonworks’ acquisition strategy comes into play, emphasizing the importance of security in Big Data. Celebrated as Hortonworks’ first acquisition, XA Secure could prove key to Hortonworks’ immediate success. Specializing in Hadoop security, Hortonworks plans to open source the XA Secure software, submit it to Apache and add it to its own Hortonworks Data Platform. Among other things, XA Secure provides a centralized admin console to manage security in Hadoop.

Hortonworks is also winning the support of strategic partners towards securing Hadoop, as Zettaset announced its support for Hortonworks’ primary Big Data platform HDP 2.1 at this week’s Hadoop Summit. A Hadoop security provider, Zettaset’s integration with Hortonworks is another step towards simplifying enterprise adoption for Hadoop products.

Cloudera is also trying its hand at securing Apache for Hadoop, enabling column-level permissions and access control, as well as group- and role-based authentication. Sqrrl, Accumulo and Cloudera’s newest investor Intel are all seeking ways to improve Hadoop security, a necessity when dealing with large volumes of potentially sensitive data. Not only does Hadoop’s perimeter need to be secured, but access granted to the right people, and enterprises must remain compliant with their data management processes.

Hadoop is going mainstream and 100% of the enterprises I spoke to said “Hadoop will replace their legacy enterprise data warehouse (EDW) products”.

Reproduced from/read the original text from Forbes

need expertise with cloud / internet scale computing / mapreduce / hadoop  etc. ? contact me - i can help! - this is a core expertise area.

The Next Big Programming Language You’ve Never Heard Of - Wired

(c) Getty

(c) Getty

Andrei Alexandrescu didn’t stand much of a chance. And neither did Walter Bright.

When the two men met for beers at a Seattle bar in 2005, each was in the midst of building a new programming language, trying to remake the way the world creates and runs its computer software. That’s something pretty close to a hopeless task, as Bright knew all too well. “Most languages never go anywhere,” he told Alexandrescu that night. “Your language may have interesting ideas. But it’s never going to succeed.”

Read the original article/reproduced from Wired

Alexandrescu, a graduate student at the time, could’ve said the same thing to Bright, an engineer who had left the venerable software maker Symantec a few years earlier. People are constantly creating new programming languages, but because the software world is already saturated with so many if them, the new ones rarely get used by more than a handful of coders—especially if they’re built by an ex-Symantec engineer without the backing of a big-name outfit. But Bright’s new language, known as D, was much further along than the one Alexandrescu was working on, dubbed Enki, and Bright said they’d both be better off if Alexandrescu dumped Enki and rolled his ideas into D. Alexandrescu didn’t much like D, but he agreed. “I think it was the beer,” he now says.

The result is a programming language that just might defy the odds. Nine years after that night in Seattle, a $200-million startup has used D to build its entire online operation, and thanks to Alexandrescu, one of biggest names on the internet is now exploring the new language as well. Today, Alexandrescu is a research scientist at Facebook, where he and a team of coders are using D to refashion small parts of the company’s massive operation. Bright, too, has collaborated with Facebook on this experimental software, as an outsider contractor. The tech giant isn’t an official sponsor of the language—something Alexandrescu is quick to tell you—but Facebook believes in D enough to keep him working on it full-time, and the company is at least considering the possibility of using D in lieu of C++, the venerable language that drives the systems at the heart of so many leading web services.

C++ is an extremely fast language—meaning software built with it runs at high speed—and it provides great control over your code. But it’s not as easy to use as languages like Python, Ruby, and PHP. In other words, it doesn’t let coders build software as quickly. D seeks to bridge that gap, offering the performance of C++ while making things more convenient for programmers.

Among the giants of tech, this is an increasingly common goal. Google’s Go programming language aims for a similar balance of power and simplicity, as does the Swift language that Apple recently unveiled. In the past, the programming world was split in two: the fast languages and the simpler modern languages. But now, these two worlds are coming together. “D is similar to C++, but better,” says Brad Anderson, a longtime C++ programmer from Utah who has been using D as well. “It’s high performance, but it’s expressive. You can get a lot done without very much code.”

<More> Continue read the original article/reproduced from Wired

New guidelines to help EU businesses use the Cloud European Commission - IP/14/743 26/06/2014



New guidelines to help EU businesses use the Cloud

Guidelines to help business users save money and get the most out of cloud computing services are being presented to the European Commission today. Cloud computing allows individuals, businesses and the public sector to store their data and carry out data processing in remote data centres, saving on average 10-20%.

need expertise with cloud / internet scale computing / data science / machine learning /mapreduce / hadoop  etc. ? contact me - i can help! - these are core expertise areas.

The guidelines have been developed by a Cloud Select Industry Group as part of the Commission’s European Cloud Strategy to increase trust in these services. Contributors to the guidelines include Arthur’s Legal, ATOS, Cloud Security Alliance, ENISA, IBM, Microsoft and SAP, Telecom Italia, (complete member list here).

Original source at Europa.eu

Today’s announcement is a first step towards standardised building blocks for Service Level Agreements (SLAs) terminology and metrics. An SLA is a part of a service contract that defines the technical and legal aspects of the service offered. The recent findings of the Trusted Cloud Europe survey show SLA standards are very much required by cloud users.

These guidelines will help professional cloud users ensure essential elements are included in plain language in contracts they make with cloud providers. Relevant items include:

  • The availability and reliability of the cloud service,

  • The quality of support services they will receive from their cloud provider

  • Security levels

  • How to better manage the data they keep in the cloud.

European Commission Vice-President @NeelieKroesEU said: “This is the first time cloud suppliers have agreed on common guidelines for service level agreements. I think small businesses in particular will benefit from having these guidelines at hand when searching for cloud services.”

Vice-President Viviane Reding said: “Today’s new guidelines will help generate trust in innovative computing solutions and help EU citizens save money. More trust means more revenue for companies in Europe’s digital single market.” She added: “This is the same spirit as the EU data protection reform which aims at boosting trust. A competitive digital single market needs high standards of data protection. EU consumers and small firms want safe and fair contract terms. Today’s new guidelines are a step in the right direction.”

As a next step, the European Commission will test these guidelines with users, in particular SMEs. It will also be discussed within the Expert Group on Cloud Computing Contracts set by the Commission in October 2013. This discussion will also involve other C-SIG activities, for example the data protection Code of Conduct for cloud computing providers that was prepared by the C-SIG on Code of Conduct. The draft Code of Conduct has been presented to the Article 29 Data Protection Working Party (European Data Protection Authorities).

This initiative will have deeper impact if standardisation of SLAs is done at international level, e.g. through international standards, such as ISO/IEC 19086. To this end, the C-SIG on SLAs is also working with the ISO Cloud Computing Working Group, to present a European position on SLA Standardisation. Today’s SLA guidelines will thus feed into ISO’s effort to establish international standards on SLAs for cloud computing.


Internet service providers commonly include SLAs in contracts with customers to define the levels of service being sold. SLAs form an important component of the contractual relationship between a customer and a provider of a cloud service. Given the global nature of the cloud, cloud contracts often span different jurisdictions, with varying applicable legal requirements, in particular with respect to the protection of personal data hosted in the cloud. Also different cloud services and deployment models will require different approaches to SLAs, adding to the complexity.

Under its second key action – safe and fair contract terms and conditions -, the European Cloud Computing Strategy called for work on model terms for cloud computing service level agreements for contracts between cloud providers and professional cloud users. The C-SIG on SLAs was convened to address this provision. This Strategy also called to identify safe and fair contract terms for contracts between cloud suppliers and consumers and small firms. For this purpose the Commission created its Expert Group on Cloud Computing Contracts.

need expertise with cloud / internet scale computing / data science / machine learning /mapreduce / hadoop  etc. ? contact me - i can help! - these are core expertise areas.

Largest collection of FREE Microsoft eBooks ever - MSDN Blogs

bcardFREE Microsoft eBooks! Who doesn’t love FREE Microsoft eBooks? Well, for the past few years, I’ve provided posts containing almost 150 FREE Microsoft eBooks and my readers, new and existing, have loved these posts so much that they downloaded over 3.5 Million free eBooks as of last June, including over 1,000,000 in a single week last year (and many, many more since then).

Eric Ligman, Microsoft Senior Sales Excellence Manager

Original post at MSDN Blogs

Data Science Workflow: Overview and Challenges - CACM

Data Science Venn Diagram

Data Science Venn Diagram

Data science is the study of the generalizable extraction of knowledge from data, yet the key word is science. It incorporates varying elements and builds on techniques and theories from many fields, including signal processing, mathematics, probability models, machine learning, statistical learning, computer programming, data engineering, pattern recognition and learning, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. Data Science is not restricted to only big data, although the fact that data is scaling up makes big data an important aspect of data science.

A practitioner of data science is called a data scientist. Data scientists solve complex data problems through employing deep expertise in some scientific discipline. It is generally expected that data scientists are able to work with various elements of mathematics, statistics and computer science, although expertise in these subjects are not required.[3] However, a data scientist is most likely to be an expert in only one or two of these disciplines and proficient in another two or three. This means that data science must be practiced as a team, where across the membership of the team there is expertise and proficiency across all the disciplines.

need expertise with cloud / internet scale computing / data science / machine learning /mapreduce / hadoop  etc. ? contact me - i can help! - these are core expertise areas.

Good data scientists are able to apply their skills to achieve a broad spectrum of end results. Some of these include the ability to find and interpret rich data sources, manage large amounts of data despite hardware, software and bandwidth constraints, merge data sources together, ensure consistency of data-sets, create visualizations to aid in understanding data, build mathematical models using the data, present and communicate the data insights/findings to specialists and scientists in their team and if required to a non-expert audience. The skill-sets and competencies that data scientists employ vary widely.

Data science techniques impact how we access data and conduct research across various domains, including the biological sciences, medical informatics, health care, social sciences and the humanities. Likewise data science heavily influences economics, business and finance. From the business perspective, data science is an integral part of competitive intelligence, a newly emerging field that encompasses a number of activities, such as data mining and data analysis, that can help businesses gain a competitive edge. – WIKIPEDIA

Data Science Workflow: Overview and Challenges – CACM

By Philip Guo

By Philip Guo

During my Ph.D., I created tools for people who write programs to obtain insights from data. Millions of professionals in fields ranging from science, engineering, business, finance, public policy, and journalism, as well as numerous students and computer hobbyists, all perform this sort of programming on a daily basis.

Read the original article from CACM

Shortly after I wrote my dissertation in 2012, the term Data Science started appearing everywhere. Some industry pundits call data science the “sexiest job of the 21st century.” And universities are putting tremendous funding into new Data Science Institutes.

I now realize that data scientists are one main target audience for the tools that I created throughout my Ph.D. However, that job title was not as prominent back when I was in grad school, so I didn’t mention it explicitly in my dissertation.

What do data scientists do at work, and what challenges do they face?

This post provides an overview of the modern data science workflow, adapted from Chapter 2 of my Ph.D. dissertation, Software Tools to Facilitate Research Programming.

The Data Science Workflow

The figure below shows the steps involved in a typical data science workflow.  There are four main phases, shown in the dotted-line boxes: preparation of the data, alternating between running the analysis andreflection to interpret the outputs, and finally dissemination of results in the form of written reports and/or executable code.

Preparation Phase

Before any analysis can be done, the programmer (data scientist) must first acquire the data and then reformat it into a form that is amenable to computation.

Acquire data: The obvious first step in any data science workflow is to acquire the data to analyze.  Data can be acquired from a variety of sources.  e.g.,:

  • Data files can be downloaded from online repositories such as public websites (e.g., U.S. Census data sets).
  • Data can be streamed on-demand from online sources via an API (e.g., the Bloomberg financial data stream).
  • Data can be automatically generated by physical apparatus, such as scientific lab equipment attached to computers.
  • Data can be generated by computer software, such as logs from a webserver or classifications produced by a machine learning algorithm.
  • Data can be manually entered into a spreadsheet or text file by a human.

The main problem that programmers face in data acquisition is keeping track of provenance, i.e., where each piece of data comes from and whether it is still up-to-date.  It is important to accurately track provenance, since data often needs to be re-acquired in the future to run updated experiments.  Re-acquisition can occur either when the original data sources get updated or when researchers want to test alternate hypotheses.  Also, provenance can enable downstream analysis errors to be traced back to the original data sources.

Data management is a related problem: Programmers must assign names to data files that they create or download and then organize those files into directories.  When they create or download new versions of those files, they must make sure to assign proper filenames to all versions and keep track of their differences.  For instance, scientific lab equipment can generate hundreds or thousands of data files that scientists must name and organize before running computational analyses on them.

A secondary problem in data acquisition is storage: Sometimes there is so much data that it cannot fit on a single hard drive, so it must be stored on remote servers.  However, anecdotes and empirical studies indicate that a significant amount of data analysis is still done on desktop machines with data sets that fit on modern hard drives (i.e., less than a terabyte).

Reformat and clean data: Raw data is probably not in a convenient format for a programmer to run a particular analysis, often due to the simple reason that it was formatted by somebody else without that programmer’s analysis in mind.  A related problem is that raw data often contains semantic errors, missing entries, or inconsistent formatting, so it needs to be “cleaned” prior to analysis.

Programmers reformat and clean data either by writing scripts or by manually editing data in, say, a spreadsheet.  Many of the scientists I interviewed for my dissertation work complained that these tasks are the most tedious and time-consuming parts of their workflow, since they are unavoidable chores that yield no new insights. However, the chore of data reformatting and cleaning can lend insights into what assumptions are safe to make about the data, what idiosyncrasies exist in the collection process, and what models and analyses are appropriate to apply.

Data integration is a related challenge in this phase.  For example, Christian Bird, an empirical software engineering researcher that I interviewed at Microsoft Research, obtains raw data from a variety of .csv and XML files, queries to software version control systems and bug databases, and features parsed from an email corpus.  He integrates all of these data sources together into a central MySQL relational database, which serves as the master data source for his analyses.

In closing, the following excerpt from the introduction of the book Python Scripting for Computational Science summarizes the extent of data preparation chores:

Scientific Computing Is More Than Number Crunching: Many computational scientists work with their own numerical software development and realize that much of the work is not only writing computationally intensive number-crunching loops.  Very often programming is about shuffling data in and out of different tools, converting one data format to another, extracting numerical data from a text, and administering numerical experiments involving a large number of data files and directories.  Such tasks are much faster to accomplish in a language like Python than in Fortran, C, C++, C#, or Java.

In sum, data munging and organization are human productivity bottlenecks that must be overcome before actual substantive analysis can be done.

Analysis Phase

The core activity of data science is the analysis phase: writing, executing, and refining computer programs to analyze and obtain insights from data.  I will refer to these kinds of programs as data analysis scripts, since data scientists often prefer to use interpreted “scripting” languages such as Python, Perl, R, and MATLAB. However, they also use compiled languages such as C, C++, and Fortran when appropriate.

The figure below shows that in the analysis phase, the programmer engages in a repeated iteration cycle of editing scripts, executing to produce output files, inspecting the output files to gain insights and discover mistakes, debugging, and re-editing.

The faster the programmer can make it through each iteration, the more insights can potentially be obtained per unit time.  There are three main sources of slowdowns:

  • Absolute running times: Scripts might take a long time to terminate, either due to large amounts of data being processed or the algorithms being slow, which could itself be due to asymptotic “Big-O” slowness and/or the implementations being slow.
  • Incremental running times: Scripts might take a long time to terminate after minor incremental code edits done while iterating on analyses, which wastes time re-computing almost the same results as previous runs.
  • Crashes from errors: Scripts might crash prematurely due to errors in either the code or inconsistencies in data sets. Programmers often need to endure several rounds of debugging and fixing banal bugs such as data parsing errors before their scripts can terminate with useful results.

File and metadata management is another challenge in the analysis phase. Repeatedly editing and executing scripts while iterating on experiments causes the production of numerous output files, such as intermediate data, textual reports, tables, and graphical visualizations.  For example, the figure below

shows a directory listing from a computational biologist’s machine that contains hundreds of PNG output image files, each with a long and cryptic filename.  To track provenance, data scientists often encode metadata such as version numbers, script parameter values, and even short notes into their output filenames.  This habit is prevalent since it is the easiest way to ensure that metadata stays attached to the file and remains highly visible.  However, doing so leads to data management problems due to the abundance of files and the fact that programmers often later forget their own ad-hoc naming conventions.  The following email snippet from a Ph.D. student in bioinformatics summarizes these sorts of data management woes:

Often, you really don’t know what’ll work, so you try a program with a combination of parameters, and a combination of input files.  And so you end up with a massive proliferation of output files.  You have to remember to name the files differently, or write out the parameters every time.  Plus, you’re constantly tweaking the program, so the older runs may not even record the parameters that you put in later for greater control.  Going back to something I did just three months ago, I often find out I have absolutely no idea what the output files mean, and end up having to repeat it to figure it out.

Lastly, data scientists do not write code in a vacuum: As they iterate on their scripts, they often consult resources such as documentation websites, API usage examples, sample code snippets from online forums, PDF documents of related research papers, and relevant code obtained from colleagues.

Reflection Phase

Data scientists frequently alternate between the analysis and reflection phases while they work, as denoted by the arrows between the two respective phases in the figure below:

Whereas the analysis phase involves programming, the reflection phase involves thinking and communicating about the outputs of analyses. After inspecting a set of output files, a data scientist might perform the following types of reflection:

Take notes: People take notes throughout their experiments in both physical and digital formats.  Physical notes are usually written in a lab notebook, on sheets of looseleaf paper, or on a whiteboard. Digital notes are usually written in plain text files, “sticky notes” desktop widgets, Microsoft PowerPoint documents for multimedia content, or specialized electronic notetaking applications such as Evernote or Microsoft OneNote.  Each format has its advantages: It is often easier to draw freehand sketches and equations on paper, while it is easier to copy-and-paste programming commands and digital images into electronic notes.  Since notes are a form of data, the usual data management problems arise in notetaking, most notably how to organize notes and link them with the context in which they were originally written.

Hold meetings: People meet with colleagues to discuss results and to plan next steps in their analyses.  For example, a computational science Ph.D. student might meet with her research advisor every week to show the latest graphs generated by her analysis scripts. The inputs to meetings include printouts of data visualizations and status reports, which form the basis for discussion.  The outputs of meetings are new to-do list items for meeting attendees.  For example, during a summer internship at Microsoft Research working on a data-driven study of what factors cause software bugs to be fixed, I had daily meetings with my supervisor, Tom Zimmermann.  Upon inspecting the charts and tables that my analyses generated each day, he often asked me to adjust my scripts or to fork my analyses to explore multiple alternative hypotheses (e.g., “Please explore the effects of employee location on bug fix rates by re-running your analysis separately for each country.”).

Make comparisons and explore alternatives: The reflection activities that tie most closely with the analysis phase are making comparisons between output variants and then exploring alternatives by adjusting script code and/or execution parameters.  Data scientists often open several output graph files side-by-side on their monitors to visually compare and contrast their characteristics. Diana MacLean observed the following behavior in her shadowing of scientists at Harvard:

Much of the analysis process is trial-and-error: a scientist will run tests, graph the output, rerun them, graph the output, etc.  The scientists rely heavily on graphs — they graph the output and distributions of their tests, they graph the sequenced genomes next to other, existing sequences.

The figure below shows an example set of graphs from social network analysis research, where four variants of a model algorithm are tested on four different input data sets:

This example is the final result from a published paper and Ph.D. dissertation; during the course of running analyses, many more of these types of graphs are produced by analysis scripts. Data scientists must organize, manage, and compare these graphs to gain insights and ideas for what alternative hypotheses to explore.

Dissemination Phase

The final phase of data science is disseminating results, most commonly in the form of written reports such as internal memos, slideshow presentations, business/policy white papers, or academic research publications. The main challenge here is how to consolidate all of the various notes, freehand sketches, emails, scripts, and output data files created throughout an experiment to aid in writing.

Beyond presenting results in written form, some data scientists also want to distribute their software so that colleagues can reproduce their experiments or play with their prototype systems.  For example, computer graphics and user interface researchers currently submit a video screencast demo of their prototype systems along with each paper submission, but it would be ideal if paper reviewers could actually execute their software to get a “feel” for the techniques being presented in each paper.  In reality, it is difficult to distribute research code in a form that other people can easily execute on their own computers.  Before colleagues can execute one’s code (even on the same operating system), they must first obtain, install, and configure compatible versions of the appropriate software and their myriad of dependent libraries, which is often a frustrating and error-prone process.  If even one portion of one dependency cannot be fulfilled, then the original code will not be re-executable.

Similarly, it is even difficult to reproduce the results of one’s own experiments a few months or years in the future, since one’s own operating system and software inevitably get upgraded in some incompatible manner such that the original code no longer runs.  For instance, academic researchers need to be able to reproduce their own results in the future after submitting a paper for review, since reviewers inevitably suggest revisions that require experiments to be re-run.  As an extreme example, my former officemate Cristian Cadar used to archive his experiments by removing the hard drive from his computer after submitting an important paper to ensure that he can re-insert the hard drive months later and reproduce his original results.

Lastly, data scientists often collaborate with colleagues by sending them partial results to receive feedback and fresh ideas. A variety of logistical and communication challenges arise in collaborations centered on code and data sharing.

need expertise with cloud / internet scale computing / data science / machine learning /mapreduce / hadoop  etc. ? contact me - i can help! - these are core expertise areas.

IBM Brings Big Data Analysis Services on IBM Cloud Marketplace - Cloud Times



Having been the first to understand that the services would yield more than the sale of equipment, IBM was the first to suggest that software sales will bring in more than services today. In continuation of that effort, IBM has launched a new big data service that is designed to deliver better results and decision-making across the enterprise. The new big data service available within the IBM Cloud marketplace will be available via any web browser, from any desktop or mobile device, and used in the context of key business processes. It combines enterprise grade security, governance and integration with mobile and web apps that are easy to interact with and use.

** Need expertise with cloud / internet scale computing / mapreduce / hadoop  etc. ? contact me - i can help! - this is a core expertise area. **

One of the offering IBM Navigator on Cloud, now available on the IBM Cloud marketplace and built on Softlayer’s Cloud platform, will help all users within and outside of the business gain value from this ever increasing deluge of content.

According to research estimates, 2.5 billion gigabytes of data are created every day, and 80 per cent of that data is comprised of unstructured content such as contracts, claims forms, and permit applications. IBM Cloud Marketplace thus serves as a unique place to learn, implement and consume these structured and unstructured data, including marketing, purchasing, sales and trade, supply chain, customer service, finance, legal and functional services for smart cities.

IBM says because IBM’s service is built on Enterprise Content Management (ECM) capabilities that have been proven in regulated environments, it delivers the ease of use associated with consumer oriented content management and file sharing applications without sacrificing security. Moreover, SoftLayer gives customers the ability to choose a cloud environment and a location that suits your business needs and provide visibility and transparency of where the data resides.

For example, a business DevOps team seeks a better way to develop new technologies to meet the dynamic business needs. So far, the team would have wide web looking for sites and providers to collect the fragments and form a solution. Today they can consolidate their evaluation, immediate testing and purchasing applications from IBM and in IBM Cloud Marketplace. A maintenance worker in the field can use a mobile device to pull up the latest schematics for a piece of equipment, take photos of a damaged part, make updates to a safety document based on a repair, and synchronize this content back to the cloud making it instantly available to colleagues on desktops or mobile devices.

The cloud computing and big data represent one of the core businesses of IBM. The market for data and analytics is estimated at $187bn by 2015. To capture this growth potential, IBM has invested more than $24bn, including $17bn of gross spend on more than 30 acquisitions. Two-thirds of IBM Research’s work are now devoted to data, analytics and cognitive computing.

On the other hand, the other component technology that is driving the new relationship is mobility. IBM supplies mobility solutions in MobileFirst that integrates strategy to help companies leverage the opportunities offered by mobile technology.

** Need expertise with cloud / internet scale computing / mapreduce / hadoop  etc. ? contact me - i can help! - this is a core expertise area. **

Cloud computing: facilitating cutting edge collaborative research - CORDIS (EU)

Cloud computing: facilitating cutting edge collaborative research

Cloud computing: facilitating cutting edge collaborative research

Cloud computing – where storage facilities are provided on demand over the internet from shared data centres – enables effective research collaboration to blossom. Rather than having to purchase a cluster of computers or struggle to find space at the lab, researchers can outsource their computing storage needs to remote facilities in the cloud and make this data accessible to colleagues.

In order to facilitate closer research collaboration, an EU-funded project entitled HELIX NEBULA (HNX) has created an online platform where customers can choose between various cloud service suppliers. The ultimate objective of the project, which was completed in May 2014, is to enable researchers and scientists to buy, use and manage cloud services as seamlessly as possible.

The team behind the project believes that cloud-based services could become a billion-euro business in the near future, helping researchers make savings of up to 40 % in infrastructure costs. Indeed, the project, which began in June 2012 with EUR 1.8 million in EU funding, anticipated that data capture, processing, and storage – crucial to scientific endeavour – were being overtaken by the demand for greater efficiency, speedier results and the increasing need for greater international collaboration.

Cloud-based services were identified as a viable solution, as they offer greater efficiency and agility in delivering services through economies of scale. A key advantage of cloud computing is its elasticity; storage space for example can be scaled up quickly depending on a research team’s needs.

One example of how cloud computing can benefit collaborative research projects is the work currently being carried out at the Large Hadron Collider at CERN in Geneva. Detectors there are searching for new discoveries in the collisions of protons of extraordinarily high energy, which could tell us more about how our universe was created and shaped. These experiments are currently running a large scale distributed computing system to process the massive amounts of data collected.

‘CERN’s computing capacity needs to keep up with the enormous amount of data coming from the Large Hadron Collider and we see Helix Nebula as a great way of working with industry to meet this challenge,’ said Frédéric Hemmer, head of CERN’s IT department.

HNX is now open to cloud providers capable of participating competitively in line with European regulations and with a suitable quality of service. Commercial cloud providers from a number of EU Member States have already joined the Helix Nebula initiative, and declared their interest in offering services via HNX. Cloud services will be offered to the global research community, for both publicly-funded and commercial organisations across a diverse range of sectors including healthcare, oil and gas, high-tech and manufacturing.

The HELIX NEBULA project is seen as a preliminary step towards establishing a pan-European cloud-based scientific e-infrastructure. Indeed, the project consortium now intends to build on the successful development of this platform to provide users with easy access to a wide range of services, including digital infrastructure, tools, information and applications.

In effect, the HNX is set to become a digital hub for researchers and scientists across Europe and beyond, encouraging the sharing of knowledge and the establishment of new virtual partnerships.

For more information, please visit:

Project factsheet

Category: Projects
Data Source Provider: HELIX NEBULA
Document Reference: Based on a press release provided by HELIX NEBULA
Subject Index: Information and communication technology applications

RCN: 36622

Hadoop for BigData on Azure - Free Book and Tutorial Video

Free ebook: Introducing Microsoft Azure HDInsight

Free ebook: Introducing Microsoft Azure HDInsight

I cannot deny that I am an avid (understatement ?Hadoop and BigData enthusiast :) – so here go a free eBook download , “Introducing Microsoft Azure HDInsight“, by Avkash Chauhan, Valentine Fontama, Michele Hart, Wee Hyong Tok, and Buck Woody as well as a great tutorial on Big Data, Hadoop, and Microsoft’s new Hadoop-based Service called Windows Azure HDInsight.

Microsoft Azure HDInsight is Microsoft’s 100 percent compliant distribution of Apache Hadoop on Microsoft Azure. This means that standard Hadoop concepts and technologies apply, so learning the Hadoop stack helps you learn the HDInsight service. At the time of this writing, HDInsight (version 3.0) uses Hadoop version 2.2 and Hortonworks Data Platform 2.0.

Download the PDF (6.37 MB; 130 pages) from http://aka.ms/IntroHDInsight/PDF

Download the EPUB (8.46 MB) from http://aka.ms/IntroHDInsight/EPUB

Download the MOBI (12.8 MB) from http://aka.ms/IntroHDInsight/MOBI

Download the code samples (6.83 KB) from http://aka.ms/IntroHDInsight/CompContent

need expertise with cloud / internet scale computing / mapreduce / hadoop  etc. ? contact me - i can help! – this is a core expertise area.

This flowing video is a general Introduction to Big Data, Hadoop, and Microsoft’s new Hadoop-based Service called Windows Azure HDInsight. This presentation is divided into two videos, this is Part 1. Big Data and Hadoop are covered in this part of the presentation. The relevant blog post is: Let there be Windows Azure HDInsight. The up-to-date presentation is on github at: http://bit.ly/ZY6DzN

need expertise with cloud / internet scale computing / mapreduce / hadoop  etc. ? contact me - i can help! - this is a core expertise area.

Why DevOps is Key to Software Success - New Relic

This eBook discusses the origins of DevOps and the benefits of embracing a DevOps philosophy to help us solve both our current and future problems.

This eBook discusses the origins of DevOps and the benefits of embracing a DevOps philosophy to help us solve both our current and future problems.

Gigaom examines the challenges and opportunities of embracing DevOps. Fact: Rapid communication and agile collaboration between your development and operations teams is more crucial to your business success than ever before. The reason for this is simple. As software continues to impact nearly every second of every single day, modern application users are increasingly demanding software that is stable, fast, and constantly being updated with new features. Of course, the best way to consistently deliver feature-rich software with speed and stability is DevOps.

Download EBook

Wikipedia : DevOps (a portmanteau of development and operations) is a software development method that stresses communication, collaboration and integration between software developers and information technology (IT) operations professionals.[1][2] DevOps is a response to the interdependence of software development and IT operations. It aims to help an organization rapidly produce software products and services.[3][4][5][6][7]

Simple processes become clearly articulated using a DevOps approach. The goal is to maximize the predictability, efficiency, security and maintainability of operational processes. This objective is very often supported by automation.

DevOps integration targets product delivery, quality testing, feature development and maintenance releases in order to improve reliability and security and faster development and deployment cycles. Many of the ideas (and people) involved in DevOps came from the Enterprise Systems Management and Agile software development movements.[8]

DevOps aids in software application release management for a company by standardizing development environments. Events can be more easily tracked as well as resolving documented process control and granular reporting issues. Companies with release/deployment automation problems usually have existing automation but want to more flexibly manage and drive this automation — without needing to enter everything manually at the command-line. Ideally, this automation can be invoked by non-operations resources in specific non-production environments. Developers are given more environment control, giving infrastructure more application-centric understanding.

Companies with very frequent releases may require a DevOps awareness or orientation program. Flickr developed a DevOps approach to support a business requirement of ten deployments per day;[9] this daily deployment cycle would be much higher at organizations producing multi-focus or multi-function applications. This is referred to as continuous deployment[10] or continuous delivery [11] and is frequently associated with the lean startup methodology.[12] Working groups, professional associations and blogs have formed on the topic since 2009.[6][13][14]

Magic Quadrant for Cloud Infrastructure as a Service - Gartner

The market for cloud compute infrastructure as a service (a virtual data center of compute, storage and network resources delivered as a service) is still maturing and rapidly evolving. Strategic providers must therefore be chosen carefully.

Read the full article/reproduced from from Gartner.

Magic Quadrant for Cloud Infrastructure as a Service - Source: Gartner (May 2014)

Magic Quadrant for Cloud Infrastructure as a Service – Source: Gartner (May 2014)

What Types of Workload Are Being Placed on Cloud IaaS?

There are three broad categories of customer needs in cloud IaaS:

  • The hosting of a single application, or a closely related group of applications
  • A VDC that will serve a broad range of different workloads
  • Batch computing

Hosting is the most common need. For instance, a media company with a marketing microsite for a movie, a software company offering SaaS and a retailer needing a lightweight version of its e-commerce site for disaster-recovery purposes are examples of customers with hosting needs that can be fulfilled by IaaS. These are generally production applications, although there is some test and development as well. Some of these customers have mission-critical needs, while others do not.

need expertise with cloud / internet scale computing / mapreduce / hadoop  etc. ? contact me - i can help!

Customers with a broad range of unrelated workloads are less common, but are growing in importance, particularly in the midmarket, where IaaS is gradually replacing or supplementing traditional data center infrastructure. The VDC is typically used very similarly to the organization’s internal virtualization environment — primarily for less mission-critical production applications, or test and development environments — but is increasingly being used to run more mission-critical applications.

The least common need, but one that nevertheless generates significant revenue for the small number of providers that serve this portion of the market, is batch computing. For these customers, IaaS serves as a substitute for traditional HPC or grid computing. Customer needs include rendering, video encoding, genetic sequencing, modeling and simulation, numerical analysis and data analytics. Other than the need to access large amounts of commodity compute at the lowest possible price, with little concern for infrastructure reliability, these customers typically have needs very similar to those of VDC customers, although some HPC use cases benefit from specialized hardware such as GPUs and high-speed interconnects.

Cloud IaaS can now be used to run most workloads, although not every provider can run every type of workload well. Service providers are moving toward infrastructure platforms that can offer physical (nonvirtualized) and virtual resources, priced according to the level of availability, performance, security and isolation that the customer selects. This allows customers to run both “cloud native” applications that have been architected with cloud transaction processing principles in mind (see “From OLTP to Cloud TP: The Third Era of Transaction Processing Aims to the Cloud”), as well as to migrate existing business applications from their own virtualized servers in internal data centers into the cloud, without changes. Cloud IaaS is best used to enable new IT capabilities, but it has become a reasonable alternative to an internal data center.

Read the full article/reproduced from from Gartner.

Top Programming Languages - IEEE Spectrum

The IEEE Spectrum Top Programming Languages app synthesizes 12 metrics from 10 sources to arrive at an overall ranking of language popularity. The sources cover contexts that include social chatter, open-source code production, and job postings. Read on to learn more about the languages we track and each of the data sources we used, what it measures, and how we measured it.

Read the article at IEEE Spectrum

ComputingSoftware Interactive: The Top Programming Languages-The ranking is calculated using 12 weighted data sources. Click a data source to toggle its inclusion in the ranking and drag its slider to reweight it.

Interactive: The Top Programming Languages-The ranking is calculated using 12 weighted data sources. Click a data source to toggle its inclusion in the ranking and drag its slider to re-weight it.

This app ranks the popularity of dozens of programming languages. You can filter them by listing only those most relevant to particular sectors, such as web or embedded programming. Rankings are created by weighting and combining 12 metrics from 10 sources. We offer preset weightings for those interested in what’s trending or most looked for by employers, or you can take complete control and create your own custom ranking by adjusting each metric’s weighting yourself. (Read about our method and sources)

Read the original / reproduced from IEEE Spectrum

What will the future of Cloud Computing look like ? IEEE Cloud Computing

IEEE Cloud Computing

IEEE Cloud Computing

Inaugural issue of IEEE Cloud Computing Magazine

The IEEE Cloud Computing Initiative (CCI), launched in April 2011, has picked up momentum since it received significant funding from the IEEE New Initiative Committee. Several products and services that have been in the works for months are now being introduced, including a website, conferences, continuing education courses, publications, standards, and a platform for testing cloud computing applications. The initiative is the first broad-based collaborative project for the cloud to be introduced by a global professional association.

Reproduced and/or syndicated content. All content and images is copyright the respective owners.

© (all) respective content owner(s)