Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Bug Programming

The Most Copied StackOverflow Java Code Snippet Contains a Bug (zdnet.com) 71

The admission comes from the author of the snippet itself, Andreas Lundblad, a Java developer at Palantir, and one of the highest-ranked contributors to StackOverflow, a Q&A website for programming-related topics. From a report: An academic paper [PDF] published in 2018 identified a code snippet Lundblad posted on the site as the most copied Java code taken from StackOverflow and then re-used in open source projects. The code snippet was provided as an answer to a StackOverflow question posted in September 2010. The code snippet printed byte counts (123,456,789 bytes) in a human-readable format, like 123.5 MB. Academics found that this code had been copied and embedded in more than 6,000 GitHub Java projects, more than any other StackOverflow Java snippet. In a blog post published last week, Lundblad said that the code had a flaw as it incorrectly converted byte counts into human-readable formats. Lundblad said he revisited the code after learning of the academic paper and its results. He looked at the code again and published a corrected version on his blog.
This discussion has been archived. No new comments can be posted.

The Most Copied StackOverflow Java Code Snippet Contains a Bug

Comments Filter:
  • by PeeAitchPee ( 712652 ) on Thursday December 05, 2019 @06:17PM (#59489334)

    while (true)
    {
    status = GetRadarInfo();
    if (status = 1)
    LaunchMissiles();
    }

  • by impaledsunset ( 1337701 ) on Thursday December 05, 2019 @06:39PM (#59489384)

    Of course, outputting 1000.0 MB and 1.0 GB for different values of bytes is an issue, but fixing it doesn't mean the choice of threshold to change unit makes the result making more sense. The snippet still prints different sizes with different precision. 981.3 MB has 4 digits of precision, whereas 1234 MB will be printed as 1.2 GB, reducing the number to only 2. That's way more serious either whether cosmetically you get 1000.0 MB for some weird corner case.

    • They both contain four digits of total precision. You're referring to discrete fractional precision. It is possible to maintain a static fractional precision but the width in characters excluding the decimal point will then change depending upon magnitude. The only way to display a number with a static character width is to count whole precision including all digits, not counting fractional digits discretely.

      If the aim is to limit the width in characters displayed, scientific notation could potentially prov

      • The number 1.2 GB needs to be viewed as 001.2 GB in this case. That's obviously how the author viewed it as each magnitude has 999.9 as its maximum leaving only space for a single fractional digit. Alternatively you could print 1234 MB == 1.234 GB and 12345 MB = 12.35 GB, operating in steps of ten rather than thousand, but in my experience this is a rare choice.

        My point is it is not definitively something you would call a "bug". It may very well be intended by the author. A bug in a definite sense would be

        • 1024MB = 1GiB

        • by tlhIngan ( 30335 )

          And I hate both of them because it hides progress when it's used when there's a transfer rate involved.

          Whether you give 3 significant figures or 4, sometimes it isn't eough. Let's say you're transferring a 1GB file at 100k/sec (we'll stick with base 10). It's fine when using 3 digits until you get to 100MB or so, then you'll see updates every 10 seconds when 1MB is transferred and ticks over 100MB to 101MB, whereas from 0-99.9MB, it updated frequently.

          If you deal with larger files, you can't really tell if

    • by AmiMoJo ( 196126 )

      It's rounded to 1 decimal place in all instances. The question doesn't specify the format beyond "human readable", giving two examples both with no decimals.

      A fixed number of decimal places is easier to read and compare and can be arranged to line up nicely with the decimal points all in the same column. With a font that has equal with digits and right justified text that is automatic.

      The usual screw-up is to strip .0 so that it ruining this nice formatting. The answer as given kinda does that because it do

  • by Kobun ( 668169 ) on Thursday December 05, 2019 @06:40PM (#59489388)
    Copy Pasta code is a bad idea, but Stack tries hard to give itself a veneer of professionalism. However, they've recently proved this hugely false as well by defaming a long-time volunteer in very casual fashion. It's a bad idea to participate there all around, especially under your real name. https://www.gofundme.com/f/sto... [gofundme.com] See also: https://m.slashdot.org/story/3... [slashdot.org]
    • They both have four digits of total precision. You're referring to discrete fractional precision. It is possible to maintain a static fractional precision but the width in characters excluding the decimal point will then change depending upon magnitude. The only way to display a number with a static character width is to count whole precision including all digits, not counting fractional digits discretely.

      If the aim is to limit the width in characters displayed, scientific notation could potentially provide

      • Not sure how I manged to reply to the wrong post. Web 2.0 is great!
        • Is your web browser written using Java code copy/pasted from StackOverflow?
          • Yes sadly, I'm using Chrome. Normally I run with scripts disabled completely, but posting to slashdot2.0 isn't possible without JS running and doing crazy stuff. That said, it's ultimately my fault since the script refreshed the page and I didn't re-read the post, rather clicked in the same place without moving the mouse and assumed it was the same post. The "not sure" part is... I have no idea how a post can jump that far vertically just by being 2 points vs. 3 points... It might have been multiple posts g

    • by johannesg ( 664142 ) on Friday December 06, 2019 @03:28AM (#59490378)

      Stack exchange is a vile cespit you'd do best to steer clear of. I stopped answering questions there after one of my answers was voted to something like -50, and the exact same answer, word by word identical, was voted to +100 and accepted.

      The people on that site are not rational adults. They are children, playing for a high score, and it's not about answering questions but about scoring. I don't want to play that game, so I left.

      • by jeremyp ( 130771 )

        Maybe you shouldn't have copy-pasted the other answer.

      • That's pretty much how Reddit works nowadays. Someone posts something popular, it gets deleted by a mod then reposted under one of the mod's alts. Just a huge pissing contest for useless internet points.

      • > They are children, playing for a high score, and it's not about answering questions

        Yes, gamification goes off the rails when the game becomes more important than the original purpose. The best incentive systems can thrive anyway, but SE isn't there yet.

      • by AmiMoJo ( 196126 )

        That's a common tactic. Someone posts an answer which is at score 0. Before it gets voted up you vote it down to -1 and copy/paste it into your own answer. Ideally you want a sock puppet to then vote you answer up to +1, but often it's not necessary as other useful idiots do it for you.

        You can sometimes report what happened but even if the copied answer gets deleted it rarely helps you.

    • Looking at this particular example, the versions on Stack Overflow are probably all better than what I would have come up with if I had implemented the function myself. Ideally these would have been put into a package somewhere so that, as bugs are fixed, the code can be updated. If somebody is looking on Stack Overflow for a piece of code its because they would find writing such code themselves difficult. Even people who are very strong in one area aren't experts in everything. And, in this case, do th
  • It is some output formatting, meant for human eyes. Very unlikely it is used for any further computation. The bug is rounding at 1 MB.
  • The bug is in the rounding. The code uses log as a shortcut, but log only works on floats. When converting the integer input to a float for computation, some bits are lost, and values very close to powers of 1000 round the wrong direction.

    • When converting the integer input to a float for computation, some bits are lost

      What? That's not how it works, until you leave a pretty wide range (especially with doubles, which, e.g., Lua exploited greatly).

      • I looked at the actual article to check this, and there are actually two separate bugs that they discuss.

        The first one is some kind of nitpicky thing where 999,999 bytes shows as 1000.0KB instead of 1.0MB. Still mathematically correct, but they seem to have a problem with it.

        The second, more "wrong" bug shows up at scales of 1000PB, which is an integer that does exceed the 53-bit precision of a double.

        • Yes, and the joint probability of having a >1000 PB file that straddles a problematic size seems minuscule. It may be a problem in the future but probably not for quite some time.
  • by Martin S. ( 98249 ) on Thursday December 05, 2019 @07:04PM (#59489442) Journal

    This incident reveals the real issue is the rise of cargo cult [wikipedia.org] programming that SO encourages. The copying of code from SO and elsewhere without actually understanding what it really does or how it functions. You often see demands to just give me the code/solution all the time, you see the wannabes regurgitate the same crap code again and again to similar questions. SO has just gone to shit over the years, full of self entitled arseholes who have gamified SO.

    In the past I contributed answers to SO and in its early days used it help understand come usage idioms; however in the day I always reproduce the requirement against my previously created JUnit test case. Typically resorting to SO to help understand why my code wasn't working, what was my flawed assumption and not for cut and paste code. Potential bugs are the least troublesome aspect of that, the risk of copyright infringement is a far bigger threat to many projects.

    • The copying of code from SO and elsewhere without actually understanding what it really does or how it functions.
      And what has that to do with cargo cult?

      SO has just gone to shit over the years, full of self entitled arseholes who have gamified SO.
      That is actually true. I got an answer once like "Who would I trust more, a guy with a gizillion of reputation or you with close to zero?". That guys answer was not only wrong but nonsense, and I got "voted down" into oblivion :D

      • Cargo cults saw the strangers come, build runways, then magic metal birds would land and give them things. When the strangers packed up and left, the magic metal birds stopped coming and giving gifts. The cargo cults started building and maintaining runways and acting out the "rights" the strangers performed to get the magic metal birds to come back and give gifts again.

        The cults simply copied what they had seen and didn't actually understand what was being done and how the process actually worked in hopes

    • I don't need to understand the code to use it. Do I really have to read the Kernel code in order to make an fopen() call? The whole point of abstraction is that you have a function that is implemented and tested (often better than you could do yourself in a reasonable period of time) and you utilize it. Have you ever looked at optimized data compression code? Yes I *could* understand it if I wanted to spend days. But why would I need to do that just to gzip up some data? What if I run gzip from the co
  • ... should go get a McJob and leave coding to the grown ups. I find it incredible anyone had to look up and then cut n paste something so utterly trivial.

    God help us if these clowns are working in safety critical systems one day.

    • Like the day someone unpublished the left-pad NodeJS module, and that whole dumpster-f^W^W ecosystem imploded [theregister.co.uk]?

    • by jetkust ( 596906 ) on Thursday December 05, 2019 @07:42PM (#59489538)
      Kind of an overreaction. It's unlikely they had to cut and paste. But if something is already written, use it and focus on something more important. No different than using packages, SDKs, or libraries.. There could be bugs in any of them. And not only that, there could have been bugs in their own code if they wrote it from scratch. And chances are this bug is insignificant compared to the dozens or hundreds of other bugs they're already dealing within code they wrote themselves. Code has bugs. Nothing alarming about it.
      • There is something you are missing. This is a simple bit of coding. The fact they have copied and pasted it implies they went looking for it. Why are they looking for it in the first place? If they went looking for a solution to this simple problem, how much of their code is copy/paste from Google searches?

      • by AmiMoJo ( 196126 )

        The difference with libraries, SDKs and packages is that they usually have some kind of update mechanism. If there is a bug it gets fixed and everyone using them gets the patch.

        Stack Overflow has no update mechanism, no notification mechanism and all the people voting the answer up don't bother to check that it actually works first.

        It's a crap answer anyway, the logarithms it uses are almost certainly going to be slower than a small loop and are highly resistant to compiler optimization. The binary unit ver

    • by thegarbz ( 1787294 ) on Thursday December 05, 2019 @07:43PM (#59489542)

      God help us if these clowns are working in safety critical systems one day.

      This comment of yours is incredibly ironic since one of the main elements of coding for safety critical systems is that someone else writes code once, it gets audited, and that this code is re-used often and identically everywhere. The most trivial of things are broken down into pre-defined and pre-certified functions and the result is literally copy and pasted.

      Yes it is trivial to do the thing required. That doesn't mean people need to re-invent the wheel every time they come across the problem. Bugs happen, the benefit of publicly posted code is the many eyes makes bugs shallow principle. Unfortunately your self invented code doesn't benefit from this "audit".

      Also worth noting is that based on the blog the bug we're talking about is a) completely benign, and b) actually present in every other code example posted in the thread, so your trivial problem you should be capable of coding yourself or get a McJob seems to not be so trivial as you think.

      • Code that shows up on SO is not peer reviewed in any sane sense of the term.

        It is trivial and the fact that multiple people made the same mistake is simply a condemnation of the quality of the people that post there.
        • Code that shows up on SO is not peer reviewed in any sane sense of the term.

          Sure it is. Just not in the scientific sense of the term. Many answers even have comments on them discussing the answer in public eye.

          It is trivial and the fact that multiple people made the same mistake is simply a condemnation of the quality of the people that post there.

          Not multiple people, *all* people made the same mistake despite all producing different solutions to the problem. So forgive me but I am convinced that you haven't at all thought about the problem enough to not make the same mistake (though now that the bug is being discussed you likely won't).

      • by jeremyp ( 130771 )

        It's not benign. If it had been programmed according to spec, the string would always have a maximum length of seven characters. In the edge case, the string can be eight characters long. There's a potential to cause a buffer overflow.

    • Why would you write code that is already written?

      • Because it is a trivial problem and the "solution" is buggy?

        If I didn't already know you are a Python end user and not a programmer, this post would tell me.
    • Any monkey could code this. Several people did code solutions and post them to Stack overflow. And each solution was wrong-ish in various ways. Showing that anyone can code it - wrong. I'm sure you can code an implementation quickly, and I'm sure that if you don't the article and comments annoy people's implementations your code will have problems in certain cases.

      If you're about to code something that you know a thousand other people have already coded before, it's not only faster but smarter to see ho

      • As an example of what I mean, many times I've copied code written by a particular programmer who is not as experienced as I am. I know for sure I could from scratch write the code he wrote, but instead I Google for his code and copy what he did because I know it's had peer review and testing.

        I know for sure that I *could* rewrite anything he wrote, because he's me. I copy my own code from several years ago instead if writing it fresh because the code I wrote 5 years ago has now had 5 years of real-world te

        • As an example of what I mean, many times I've copied code written by a particular programmer who is not as experienced as I am. I know for sure I could from scratch write the code he wrote, but instead I Google for his code and copy what he did because I know it's had peer review and testing.

          I know for sure that I *could* rewrite anything he wrote, because he's me. I copy my own code from several years ago instead if writing it fresh because the code I wrote 5 years ago has now had 5 years of real-world testing. Writing it be again is an opportunity to create a new big, with no upside.

          If it is your code, then why don't you have a library of useful code snippets on development machine instead of searching via Google? Also, this is not a representative case.

          • > If it is your code, then why don't you have a library of useful code snippets on development machine

            That might be somewhat useful. Of course, whichever company paid me to come in and write their code for them owns the code, so I'm mostly only going to re-use code that a) has been released publicly or b) was written for a company I "owned".* A lot of the code I've written in in the latter category, and I do have a local copy of that.

            For the open-source code, I might as well grab it from CPAN or GitHub o

    • If you read the blog post by the author of the snippet, the "bug" isn't what I'd term utterly trivial.

      Read here: https://programming.guide/worl... [programming.guide]

      Is it the most complex problem in the world? No, of course not. And perhaps if the author didn't want to avoid loops the bug would never have happened. The author even admits he would not use the snippet in production code.

      For the aforementioned link:

      Note that this started out as a challenge to avoid loops and excessive branching. After ironing out all corner cases the code is even less readable than the original version. Personally I would not copy this snippet into production code.

  • Another approach is to do all the processing in text. Not sure about the efficiency though, but it allows any numeric type to be used. Here's an example: https://stackoverflow.com/questions/808104/engineering-notation-in-c/48616532#48616532 [stackoverflow.com]
  • Highly unlikely that an employee of Palantir would deliberately introduce errors into code contributed to open sources. Given their business and all.

  • The bug (Score:5, Informative)

    by Psychotria ( 953670 ) on Thursday December 05, 2019 @09:19PM (#59489708)

    You'd think the article linked to would at least mention what the bug was. Instead it links to a research paper about SO attribution which doesn't mention the snippet has a bug at all, just that it's the most copied. The following blog post (by the author of the snippet) explains the bug and discusses how it was fixed. Quite interesting really.

    https://programming.guide/worl... [programming.guide]

  • Ah, "Cut'N'Paste Techies"!

    I grew up coding on 8bit micros in the 1980s, you learned on your own, from a friend or from maybe one or two books and mags you could actually find. I loved typing in long-ass listings from mags that never worked because you always had to fix the listings, that gave you some great experience in debugging BASIC code and sometimes Z80 assembler. My mates and I would be just a handful of the nerdy kids in the school computer room at lunchtimes and after school, chatting to teachers,

    • Both me and my stepfather made a good living out of rewriting crap code generated by people who got into IT because it's 'where the money was'. I'm sure there's plenty of other old school programmers that look upon today's slacker generation, as we look at the mess they collectively created for themselves, and know there was once a better way.
  • Comment removed based on user account deletion
  • https://programming.guide/worl... [programming.guide]

    The "final", "fixed", version:
    public static strictfp String humanReadableByteCount(long bytes, boolean si) {
    int unit = si ? 1000 : 1024;
    long absBytes = bytes == Long.MIN_VALUE ? Long.MAX_VALUE : Math.abs(bytes);
    if (absBytes < unit) return bytes + " B";
    int exp = (int) (Math.log(absBytes) / Math.log(unit));
    long th = (long) (Math.po

E = MC ** 2 +- 3db

Working...